Readiness probe failed #130

morotsgurka · 2024-10-04T22:38:47Z

Hi!

I'm running k3s on 3 nodes together with Longhorn and FluxCD.
immich-chart version: 0.8.1

Found this issue on the immich repo with similar problem, but the problem there seemed to be that he had not enabled postgresql, but I have.

This is my values file:

env:
  DB_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: password
  DB_DATABASE_NAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: database
  DB_USERNAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: username

immich:
  metrics:
    # Enabling this will create the service monitors needed to monitor immich with the prometheus operator
    enabled: false
  persistence:
    # Main data store for all photos shared between different components.
    library:
      existingClaim: immich-library-pvc-claim

# Dependencies

postgresql:
  enabled: true
  global:
    postgresql:
      auth:
        existingSecret: immich-postgresql-secret
  primary:
    persistence:
      existingClaim: immich-postgresql-pvc-claim
    containerSecurityContext:
      readOnlyRootFilesystem: false
    initdb:
      scripts:
        create-extensions.sql: |
          CREATE EXTENSION cube;
          CREATE EXTENSION earthdistance;
          CREATE EXTENSION vectors;

redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: false
  master:
    persistence:
      existingClaim: immich-redis-pvc-claim


# Immich components
server:
  enabled: true

machine-learning:
  enabled: true
  env:
    TRANSFORMERS_CACHE: /cache
  persistence:
    cache:
      enabled: true
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: pvc
      existingClaim: immich-ml-cache-pvc-claim

I have created volumes in Longhorn with PVC's for each claim. They all attach normally.
I can see that all pods are running, except the immich-server, which fails to give a readiness probe. If I check the pod logs I just see:

Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/04/2024, 10:30:08 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/04/2024, 10:30:08 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/04/2024, 10:30:08 PM     LOG [Api:EventRepository] Initialized websocket server

For the immich-postgresql-secret I have just created a generic secret in the same namespace like:

kubectl create secret generic immich-postgresql-secret -n immich --from-literal=postgres-password="cool-secret-password" --from-literal=password="cool-secret-password" --from-literal=username="immich" --from-literal=database="immich"

The text was updated successfully, but these errors were encountered:

bo0tzz · 2024-10-05T08:27:47Z

I think the pod is just taking a long time to start. You might want to try making the readiness timeout a bit longer.

morotsgurka · 2024-10-05T10:16:39Z

Hi!

Not sure if this was how you meant but I increased the readiness and startup probes by modifying the server values:

# Immich components
server:
  enabled: true
  probes:
    readiness:
      custom: true
      spec:
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
    startup:
      custom: true
      spec:
        initialDelaySeconds: 0
        timeoutSeconds: 5
        ## This means it has a maximum of 5*30=150 seconds to start up before it fails
        periodSeconds: 10
        failureThreshold: 60

By inspecting the pod I can see that the changes are reflected on the pod:

Liveness
http-get http://:3001/api/server/ping delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness
http-get http://:3001/api/server/ping delay=30s timeout=5s period=10s #success=1 #failure=3
Startup
http-get http://:3001/api/server/ping delay=0s timeout=5s period=10s #success=1 #failure=60

But i still get

Startup probe failed: Get "http://10.42.2.208:3001/api/server/ping": dial tcp 10.42.2.208:3001: connect: connection refused

All i see in the pod logs is still:

Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/05/2024, 10:09:17 AM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/05/2024, 10:09:17 AM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/05/2024, 10:09:17 AM     LOG [Api:EventRepository] Initialized websocket server

madeofpendletonwool · 2024-10-05T14:25:19Z

I'm having a similar issue as well
Warning Unhealthy <invalid> (x3 over <invalid>) kubelet Startup probe failed: Get "http://10.42.2.120:3001/api/server/ping": dial tcp 10.42.2.120:3001: connect: connection refused Warning Unhealthy <invalid> kubelet Startup probe failed: HTTP probe failed with statuscode: 404
Digging into the logs I see

[Nest] 17 - 10/05/2024, 2:20:02 PM LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:3001 [v1.106.1] [PRODUCTION]
Error: connect ECONNREFUSED 10.43.247.217:6379
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1606:16) {
errno: -111,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '10.43.247.217',
port: 6379
}
Error: connect ETIMEDOUT
at Socket. (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41)
at Object.onceWrapper (node:events:633:28)
at Socket.emit (node:events:519:28)
at Socket._onTimeout (node:net:590:8)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
errorno: 'ETIMEDOUT',
code: 'ETIMEDOUT',
syscall: 'connect'
}

Seemingly issues with redis. Granted, perhaps I'm setting up redis wrong and this is different than your issue. But I don't think so. My redis pods are starting fine and I'm deploying it along with the rest of the helm chart. So should be what's provided by default.

morotsgurka · 2024-10-05T15:56:49Z

I had similar, if not the same errors about Redis. But I think that was just initially before the Redis pod had started? Now if I evict the Immich server pod I don't see any Redis errors.

madeofpendletonwool · 2024-10-05T16:21:26Z

Good call @morotsgurka. I think you're right. They went away for me too. App still not starting of course. Presumably due to to the probes.

AlexBevan · 2024-10-05T16:52:55Z

same, reverting to tag v1.116.0 is fine, however same error if v1.117.0

morotsgurka · 2024-10-05T17:29:59Z

same, reverting to tag v1.116.0 is fine, however same error if v1.117.0

Can confirm. Added image tag of 1.116 and now it starts

joaopedrocg27 · 2024-10-06T21:02:28Z

Same issue here:

│ Detected CPU Cores: 3                                                                                                                                                                                          │
│ Starting api worker                                                                                                                                                                                            │
│ Starting microservices worker                                                                                                                                                                                  │
│ [Nest] 7  - 10/06/2024, 8:59:42 PM     LOG [Microservices:EventRepository] Initialized websocket server                                                                                                        │
│ [Nest] 16  - 10/06/2024, 8:59:43 PM     LOG [Api:EventRepository] Initialized websocket server                                                                                                                 │
│ [Nest] 7  - 10/06/2024, 8:59:51 PM     LOG [Microservices:MapRepository] Initializing metadata repository

Using Chart 0.8.1 and 1.107.0

zlq1503 · 2024-10-08T07:53:25Z

I have same issue ,now I rolle to tag v1.116.0

boredomwontgetus · 2024-10-10T15:13:34Z

Same issue here! Last working tag for me is v1.116.2

kabakaev · 2024-10-10T16:09:04Z

Readiness probe fails because the immich-api process does not listen on TCP port 3001. I don't know why.

I have two immich instances on one kubernetes node. Old instance is running v1.116.2, whereas the new is trying to start with v1.117.0. The new instance does not listen on TCP/3001, although the liveness probe timeout was increased from 30 to 690 seconds, to prevent killing the pod too early.

ps auxw | grep immich | grep -vE 'postgres|grep'
# root       52171 11.7  2.0 22835356 325564 ?     Sl   17:48   0:19 immich
# root       52227  4.7  1.0 11648044 174408 ?     Sl   17:48   0:07 immich-api
# root     2082901  0.2  1.6 22981404 268440 ?     Sl   Okt02  31:55 immich
# root     2083038  0.0  0.9 11658532 148792 ?     Sl   Okt02   8:37 immich-api

nsenter --net --target 2082901
# We are in the linux network namespace of the immich-server v1.116.2 container.
ss -apn | grep LISTEN
# tcp   LISTEN 0      511                *:8082                *:*        users:(("immich",pid=2082901,fd=24))
# tcp   LISTEN 0      511                *:8081                *:*        users:(("immich-api",pid=2083038,fd=19))
# tcp   LISTEN 0      511                *:3001                *:*        users:(("immich-api",pid=2083038,fd=39))
# tcp   LISTEN 0      511                *:33673               *:*        users:(("immich",pid=2082901,fd=72))
exit # back to host NS.
nsenter --net --target 40560
ss -apn | grep LISTEN
# We are in the linux network namespace of the immich-server v1.117.0 container.
# tcp   LISTEN 0      511                *:8081                 *:*         users:(("immich-api",pid=52227,fd=19))
# tcp   LISTEN 0      511                *:8082                 *:*         users:(("immich",pid=52171,fd=24))

The v1.117.0 logs are pretty short. Both v1.117.0 and v1.116.2 logs are given below.

Setting the env.IMMICH_IGNORE_MOUNT_CHECK_ERRORS: true helm value did not help.

Edit: setting env.IMMICH_LOG_LEVEL: debug or env.IMMICH_LOG_LEVEL: verbose does not change the log output. The same 3 messages are printed, then nothing.

Edit 2: when the container is killed, it prints out the fourth log line: [Nest] 17 - 10/10/2024, 5:42:47 PM LOG [Api:MapRepository] Initializing metadata repository. It is probably caused by the event log caching.

# kubectl -n immich1 logs deploy/immich-server
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 6  - 10/10/2024, 3:53:23 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 6  - 10/10/2024, 3:53:23 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 16  - 10/10/2024, 3:53:23 PM     LOG [Api:EventRepository] Initialized websocket server

# kubectl -n immich2 logs deploy/immich-server | less -R
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 8  - 10/02/2024, 10:51:46 AM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 18  - 10/02/2024, 10:51:46 AM     LOG [Api:EventRepository] Initialized websocket server
[Nest] 8  - 10/02/2024, 10:51:47 AM     LOG [Microservices:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:ServerService] Feature Flags: {  "smartSearch": true, "facialRecognition": true, "duplicateDetection": true, "map": true, "reverseGeocoding": true, "importFaces": false, "sidecar": true, "search": true, "trash": true, "oauth": false, "oauthAutoLaunch": false, "passwordLogin": true, "configFile": false, "email": false }
[Nest] 8  - 10/02/2024, 10:51:47 AM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:StorageService] Verifying system mount folder checks (enabled=false)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:StorageService] Writing initial mount file for the encoded-video folder
etc.

kabakaev · 2024-10-10T18:11:16Z

Hm, something big has changed there. Previously, the MapRepository was under Microservices, but in v1.107.0 it became Api:MapRepository, though the MapRepository code itself did not change since 3 month.
Could it be the reason for the startup issue?

kabakaev · 2024-10-10T18:44:26Z

There are only 34 commits changing the server. Someone has to bisect, build and deploy them :)

git log v1.106.2..v1.107.0 | grep server -B4 | grep '\--' |wc -l
34

cconcannon · 2024-10-12T04:50:15Z

I also experienced the same issue as I tried to install immich for the first time. All pods start except for immich-server, which is continuously unhealthy due to startup probe failure:

Startup probe failed: Get "http://10.42.1.21:3001/api/server/ping": dial tcp 10.42.1.21:3001: connect: connection refuse

In the logs for the server pod, there's lots of this:

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket._onTimeout (node:net:591:8)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I've confirmed that the same errors happen for me on v1.117.0 and v1.116.2.

morotsgurka · 2024-10-12T09:55:19Z

@cconcannon have you enabled the Redis pods? This issue seems similar to earlier comments in this thread? I also had similar issues but only on first boot before redis was initialized. And the issue we seem to be having is not present on 1.116

revolunet · 2024-10-12T22:42:59Z

same issue here with startup probe and helm chart 0.8.1 and immich v1.117.0. setting image.tag to v1.116.0 fixes it.

cconcannon · 2024-10-14T00:34:00Z

@morotsgurka yes I enabled Redis and Redis starts successfully. I see Redis, Postgres, and Machine Learning pods all successfully start. Server continues to have errors, even after I try deleting it.

evanreichard · 2024-10-14T03:11:12Z

Ive had continuous problems with the probes and completely disabled them. That has helped in the past. But it hangs indefinitely for me on LOG [Api:EventRepository] Initialized websocket server for 1.117 never coming online. Once I drop the tag to 1.116.2, everything works fine.

rjbez17 · 2024-10-14T03:49:17Z

I spent quite a while trying to fix this. To the point of taking a DB backup, deploying a standalone postgresql, and reinstalling 116, then upgrading again to 117. What I ultimately determined (for me at least) was that immich leaves the postgres resources as the default (nano: https://github.com/bitnami/charts/blob/main/bitnami/postgresql/values.yaml#L465C3-L470C26), which is quite low. After upping postgresql to large I saw more than just the three logs in immich-server and things started working. You may not need large but I have plenty of compute to throw at it so I did.

My suspiscion is that the DB migration required for 117 (or something) takes way too long on the nano and the probes cause a restart, causing the DB migration to kick off again (or something).

zlq1503 · 2024-10-14T07:58:38Z

@rjbez17 I verified this solution, because the postgres container resources are too small, set the resourcesPreset of postgresql: "large", and immich-server starts normally.

boredomwontgetus · 2024-10-14T08:20:53Z

I run into the same problem when using an PG DB outside a container.
Edit: Forget about that. Idiot me forgot to update the chart.

bo0tzz · 2024-10-14T12:44:12Z

I was pretty stumped by this issue, but in hindsight the postgres resources is obvious 🤦
With that I'll close this issue. I'm a bit hesitant to set a higher default resources as it might make the pod unschedulable for some, though leaving it this low isn't really feasible either. I'll have a think about that.

While I have you all here: I'd love your feedback on #129.

kabakaev · 2024-10-15T14:13:17Z

Setting postgresql.primary.resourcesPreset: large did not help. The database received a flow of geodata INSERT SQL requests, which consumed ca. 15% of one CPU core and around 100MB RAM. Unfortunately, the immich-api service did not start up during the default 5 minutes timeout of the startupProbe.

If the startup probe never succeeds, the container is killed...

Finally, it helped to increase the startupProbe timeout to 1 hour:

server:
  probes:
    startup:
      spec:
        failureThreshold: 360

So, this issue should have been called Startup probe failed, probably due to slow import of geo-data.

bo0tzz closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness probe failed #130

Readiness probe failed #130

morotsgurka commented Oct 4, 2024

bo0tzz commented Oct 5, 2024

morotsgurka commented Oct 5, 2024

madeofpendletonwool commented Oct 5, 2024

morotsgurka commented Oct 5, 2024

madeofpendletonwool commented Oct 5, 2024

AlexBevan commented Oct 5, 2024

morotsgurka commented Oct 5, 2024

joaopedrocg27 commented Oct 6, 2024

zlq1503 commented Oct 8, 2024

boredomwontgetus commented Oct 10, 2024

kabakaev commented Oct 10, 2024 •

edited

Loading

kabakaev commented Oct 10, 2024

kabakaev commented Oct 10, 2024

cconcannon commented Oct 12, 2024

morotsgurka commented Oct 12, 2024 •

edited

Loading

revolunet commented Oct 12, 2024 •

edited

Loading

cconcannon commented Oct 14, 2024

evanreichard commented Oct 14, 2024

rjbez17 commented Oct 14, 2024 •

edited

Loading

zlq1503 commented Oct 14, 2024

boredomwontgetus commented Oct 14, 2024 •

edited

Loading

bo0tzz commented Oct 14, 2024

kabakaev commented Oct 15, 2024 •

edited

Loading

Readiness probe failed #130

Readiness probe failed #130

Comments

morotsgurka commented Oct 4, 2024

bo0tzz commented Oct 5, 2024

morotsgurka commented Oct 5, 2024

madeofpendletonwool commented Oct 5, 2024

morotsgurka commented Oct 5, 2024

madeofpendletonwool commented Oct 5, 2024

AlexBevan commented Oct 5, 2024

morotsgurka commented Oct 5, 2024

joaopedrocg27 commented Oct 6, 2024

zlq1503 commented Oct 8, 2024

boredomwontgetus commented Oct 10, 2024

kabakaev commented Oct 10, 2024 • edited Loading

kabakaev commented Oct 10, 2024

kabakaev commented Oct 10, 2024

cconcannon commented Oct 12, 2024

morotsgurka commented Oct 12, 2024 • edited Loading

revolunet commented Oct 12, 2024 • edited Loading

cconcannon commented Oct 14, 2024

evanreichard commented Oct 14, 2024

rjbez17 commented Oct 14, 2024 • edited Loading

zlq1503 commented Oct 14, 2024

boredomwontgetus commented Oct 14, 2024 • edited Loading

bo0tzz commented Oct 14, 2024

kabakaev commented Oct 15, 2024 • edited Loading

kabakaev commented Oct 10, 2024 •

edited

Loading

morotsgurka commented Oct 12, 2024 •

edited

Loading

revolunet commented Oct 12, 2024 •

edited

Loading

rjbez17 commented Oct 14, 2024 •

edited

Loading

boredomwontgetus commented Oct 14, 2024 •

edited

Loading

kabakaev commented Oct 15, 2024 •

edited

Loading