Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Closed
TheOnlyWayUp opened this issue Dec 6, 2024 · 28 comments
Closed

Frequent 504s and Poor Uptime on Docker Compose deployments #821

TheOnlyWayUp opened this issue Dec 6, 2024 · 28 comments
Labels
bug Something isn't working

Comments

@TheOnlyWayUp
Copy link

TheOnlyWayUp commented Dec 6, 2024

Fix: Thanks @theboringhumane! Update your traefik.yml with #821 (comment)

To Reproduce

  1. Create multiple docker compose services in the same project
  2. Uptime checking

image
image
This isn't an issue with UptimeKuma, because there are long periods of inactivity on my statistics as well.

Uptime stats with large blocks of empty:
image

Before moving composes to Dokploy
image


During a "downtime",

  • Nothing on service logs
  • Traefik request logs show 504s and timeouts

image

Current vs. Expected behavior

  • Services are supposed to be online until turned off
  • Current: Services are online on Dokploy's console but unreachable by the network intermittently

Provide environment information

CPU: AMD Ryzen 7 3700X (16) @ 3.600
GPU: 2b:00.0 ASPEED Technology, Inc
Memory: 13894MiB / 64221MiB
OS: Ubuntu 24.04 LTS x86_64
Host: 1.0
Kernel: 6.8.0-49-generic
Dokploy Version: v0.12.0

Which area(s) are affected? (Select all that apply)

Docker Compose

Are you deploying the applications where Dokploy is installed or on a remote server?

Same server where Dokploy is installed

Additional context

This doesn't happen when deploying on the host system without Dokploy, circumventing traefik.

Will you send a PR to fix it?

Maybe, need help

@TheOnlyWayUp TheOnlyWayUp added the bug Something isn't working label Dec 6, 2024
@TheOnlyWayUp
Copy link
Author

Services go down and come back up in a few minutes all throughout the day, it's tanked uptime to 30%.

Forgejo Docker compose:

version: "3"

services:
  server:
    image: codeberg.org/forgejo/forgejo:8
    container_name: forgejo
    environment:
      - USER_UID=1000
      - USER_GID=1000
    restart: always
    networks:
      - default
    volumes:
      - /root/Projects/Forge/forgejo:/data
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    ports:
      - 5005:5005
      - 222:22
    expose:
      - 5005
networks:
  default:

Ghost Docker compose

version: '3.1'

services:
  ghost:
    image: ghost:5-alpine
    restart: always
    expose:
      - 2368
    networks:
      - default
    environment:
      # see https://ghost.org/docs/config/#configuration-options
      database__client: mysql
      database__connection__host: db
      database__connection__user: root
      database__connection__password: 
      database__connection__database: ghost
      # this url value is just an example, and is likely wrong for your environment!
      url: https://blog.rambhat.la
      # contrary to the default mentioned in the linked documentation, this image defaults to NODE_ENV=production (so development mode needs to be explicitly specified if desired)
      #NODE_ENV: development
    labels:
    - "traefik.enable=true"

    # Middleware for replacing content in the body
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].regex=</head>"
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].replacement=<script defer src='https://stats.towu.dev/script.js' data-website-id='4d72a7bf-3049-4c82-8ff4-05c0bc4f8edf'></script></head>"

    # Link the middleware to the router
    - "traefik.http.routers.blog.middlewares=inject-script"

    volumes:
      - ghost_ghost:/var/lib/ghost/content
    depends_on:
      - db

  db:
    image: mysql:8.0
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: 
    expose:
      - 3306
    volumes:
      - ghost_db:/var/lib/mysql
        #
volumes:
  ghost_ghost:
    external: true
  ghost_db:
    external: true

networks:
  default:

Umami:

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    environment:
      DATABASE_URL: postgresql://umami:umami@db:5432/umami
      DATABASE_TYPE: postgresql
      APP_SECRET: 
    depends_on:
      db:
        condition: service_healthy
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'curl http://localhost:3000/api/heartbeat']
      interval: 5s
      timeout: 5s
      retries: 5
    expose:
      - 3000
    ports:
      - 4999:3000
    networks:
      - default
      
  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: umami
      POSTGRES_USER: umami
      POSTGRES_PASSWORD: umami
    expose:
      - 5432
    volumes:
      - /root/Projects/miami/var/lib/postgresql/data:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - default

networks:
  default:
    driver: bridge 

These are the docker composes for affected services

@TheOnlyWayUp
Copy link
Author

I believe it's an issue with Traefik, I can access the port-forwarded services (for example, Umami is forwarded to 4999 on the host and stats.towu.dev via traefik).

When stats.towu.dev is down, I can still access host:4999 to see Umami, so I'm pretty confident it's a proxy issue.

Something peculiar, while all the affected compose services go down at the same time (Ghost, Umami, and Forgejo). Other compose projects, like Immich, don't go down at all. Immich is a photo-management app which has a website as a part of the dockercompose, like the other services.

Immich (no dowmtime) Dockercompose

version: "3"
name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    networks:
      - default
    extends:
      file: ../../hwaccel.transcoding.yml
      service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      # Do not edit the next line. If you want to change the media storage location on your system, edit the value of UPLOAD_LOCATION in the .env file
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - /root/Projects/Immich/external:/mnt/media:ro
    env_file:
      - .env
    ports:
      - xxxx:2283
    expose:
      - 2283
    depends_on:
      - redis
      - database
    restart: always
    healthcheck:
      disable: false

  immich-machine-learning:
    container_name: immich_machine_learning
    networks:
      - default
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always
    healthcheck:
      disable: false

  redis:
    container_name: immich_redis
    networks:
      - default
    image: docker.io/redis:6.2-alpine@sha256:e3b17ba9479deec4b7d1eeec1548a253acc5374d68d3b27937fcfe4df8d18c7e
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    networks:
      - default
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      # Do not edit the next line. If you want to change the database storage location on your system, edit the value of DB_DATA_LOCATION in the .env file
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    command: ["postgres", "-c", "shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

volumes:
  model-cache:

networks:
  default:

@TheOnlyWayUp
Copy link
Author

Immich has downtime as well.

Related: #656 #734 #752

Related documentation, https://docs.dokploy.com/docs/core/troubleshooting#docker-compose-domain-not-working

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    ...
    expose:
      - 3000
    ports:
-      - 4999:3000
+     - 3000
    networks:
      - default
      
  db:
    image: postgres:15-alpine
    ...
    networks:
      - default

networks:
  default:
    driver: bridge 

I'm trying this just to check, I need the ports forwarded as I can't upload large files through the cloudflare-proxied domain for Immich, for example.

@TheOnlyWayUp
Copy link
Author

image
The ghost service goes down often (not from Dokploy's template), and has no ports forwarded.

version: '3.1'

services:

  ghost:
    image: ghost:5-alpine
    expose:
      - 2368
    networks:
      - default
    ...
    labels:
    - "traefik.enable=true"

    # Middleware for replacing content in the body
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].regex=</head>"
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].replacement=<script defer src='https://stats.towu.dev/script.js' data-website-id='4d72a7bf-3049-4c82-8ff4-05c0bc4f8edf'></script></head>"

    # Link the middleware to the router
    - "traefik.http.routers.blog.middlewares=inject-script"

    depends_on:
      - db

  db:
    image: mysql:8.0
    ...
    expose:
      - 3306

volumes:
  ghost_ghost:
    external: true
  ghost_db:
    external: true

networks:
  default:

I'll keep an eye on the uptime

@Siumauricio
Copy link
Contributor

I know what could be the error, currently there is a very rare bug related to docker compose, if you use the name of a duplicate service in several places it is possible that the information is mixed somehow, I have not yet found a solution to this problem, my suggestion would be, change the name of the service

services:
      db:
          .....

to something like this

services:
      ghost-db:
          .....

@TheOnlyWayUp
Copy link
Author

I've updated my services to use prefixed names, I guess that's what the randomize compose is for.

Is there anything I can do to provide some more insight? Traefik logs, if you lmk how I can get em. (docker logs would be enough?)

Likely related, umami-software/umami#3080 (reply in thread) - I believe another service was attempting to access Umami's database, leading to that error.

@TheOnlyWayUp
Copy link
Author

Oh, is it because all the containers are part of the dokploy-network network, and names are resolved over this network instead of default? Dokploy also removes the default network unless it's explicitly included in the compose.

@TheOnlyWayUp
Copy link
Author

@Siumauricio I updated the services to have unique names and rebuilt the project
image
Still having uptime issues, this is my updated compose

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    environment:
      DATABASE_URL: postgresql://umami:xxx@umami_db:5432/umami
    depends_on:
      db:
        condition: service_healthy
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'curl http://localhost:3000/api/heartbeat']
      interval: 5s
      timeout: 5s
      retries: 5
    expose:
      - 3000
    ports:
      - 3000
    networks:
      - default
      
  umami_db:
    image: postgres:15-alpine
    expose:
      - 5432
    volumes:
      - ...:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - default

networks:
  default:
    driver: bridge 

@Siumauricio
Copy link
Contributor

the problem still persists?

@TheOnlyWayUp
Copy link
Author

Yep,
image

@kamellperry
Copy link

Experiencing very similar issues. I'm also using the cloud hosted version of dokploy instead of self hosted because I thought that might have been why. After doing some digging It's definitely the reverse proxy stuff.

@2shrestha22
Copy link

My server was down 5 min ago. I am not monitoring but I assume this is still a issue.

@TheOnlyWayUp
Copy link
Author

@Siumauricio This issue is causing me a lot of trouble, is there anything I can do to help?

@Siumauricio
Copy link
Contributor

Yes I definitely think it is a bug in docker at the network level, I think we must find a solution to this problem because currently we can not have 2 instances of the same template because sometimes it causes the information to be mixed which is a very strange behavior, I will investigate in more detail how to solve this, the idea would be to isolate the docker compose in a separate network.

@TheOnlyWayUp
Copy link
Author

@Siumauricio I tried the fix in #1004 (randomize compose names) and the uptime hasn't improved at all.

This issue is urgent and affecting my users. Broken networking is a dealbreaker, is there anything else I can try?

Last ditch effort would be disabling Traefik and using a reverse proxy on host networking, or moving to another platform - which is a huge effort.

Are there any blockers for this issue? Any logs or information you need? Anything?

@dreiekk
Copy link

dreiekk commented Jan 4, 2025

I'm having similiar problems - randomizing compose names also didn't fix it for me.

I also suspect it has something to do with the same internal port which is published from similiar services/containers on the same dokploy-network or the traefik config gets broken because of that same internal port despite they are on different services.

Feel free to ping me as well if I can provide any logs, information or test something helpful to this issue.


Read more...
I have 2 different dokploy projects on one server - each containing 2 docker compose services.
For example one docker compose is `nextcloud + mariadb + redis`.

I get this problem despite the nextcloud webserver images having different docker image tags/versions in both projects.

Whenever I deploy the service from the second project, the container of the first project is not reachable anymore with traefik error page "404 page not found".
When I now deploy the second docker compose, it will start working, but the first one gets a 540 Gateway Timeout.
I have to stop the second project and deploy the first one again to make the first one start working again.

I also defined a custom-named network for each service (docker compose), so the database and webserver in a single docker compose can communicate:

services:
  aaa-nextcloud-app:
    ...
    networks:
      - aaa-nextcloud-network
  aaa-nextcloud-db:
    ...
    networks:
      - aaa-nextcloud-network

networks:
  aaa-nextcloud-network:
    driver: bridge 

There are also many other services running on my single server which are working fine and seem not to be affected by these beforementioned problamatic deployments.

@dreiekk
Copy link

dreiekk commented Jan 4, 2025

FYI: Our current workaround is to set the port of the application/webserver itself inside the container to something different for each service. So the similiar webservers which would normally all listen on port 80 now listen on 81, 82, 83, ...

@TheOnlyWayUp
Copy link
Author

Something else I noticed, whenever the services are unreachable, I'm unable to view logs from the dokploy dashboard, it's just empty.

The logs load when the service is available via the domain, which is weird, because it's reachable through port mappings regardless

@theboringhumane
Copy link

theboringhumane commented Jan 10, 2025

Go to traefik file system in your dokploy dashboard and do this

traefik.yml

providers:
  swarm:
    exposedByDefault: false
    watch: true
  docker:
    exposedByDefault: false
    watch: true
    network: dokploy-network

@TheOnlyWayUp

The error comes from your networks, you created 2 networks and the authelia container is assigned to both of them. Traefik, while forwarding, doesn't know which network to use. So you have to specify it in your docker provider configuration:

@theboringhumane
Copy link

@Siumauricio you can close this.

@dreiekk
Copy link

dreiekk commented Jan 10, 2025

Thanks! @theboringhumane .

Have to observe it a bit more to be sure, but I suppose it's working now on my end.

I switched all my services to type stack instead of compose, added @theboringhumane 's traefik options and configured my networks inside the docker-compose.yml's like this:

networks:
  my_compose_network_123:
    driver: overlay
    name: my_compose_network_123
    attachable: true

@TheOnlyWayUp
Copy link
Author

TheOnlyWayUp commented Jan 10, 2025

Hey @theboringhumane, thanks for the solution! I'll try it and update the issue.

The error comes from your networks, you created 2 networks and the authelia container is assigned to both of them. Traefik, while forwarding, doesn't know which network to use. So you have to specify it in your docker provider configuration:

I have a few questions,

  1. What Authelia container?
    image

  2. What networks could be causing the conflict? The services are connected to the compose's default network (ocassionally that's a bridge network, but the problem persists regardless) and to dokploy's network.

  3. Are you saying the traefik container is being added to multiple networks?

~ docker inspect dokploy-traefik.1.s2o77zzkq0hsqi8x7p837a8w8
[
    {
        ...
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "a6e4d65f981460a574c39da307579f5e181301a20f21505d392970a6429a6073",
            "SandboxKey": "/var/run/docker/netns/a6e4d65f9814",
            "Ports": {
                "443/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "443"
                    },
                    {
                        "HostIp": "::",
                        "HostPort": "443"
                    }
                ],
                "80/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "80"
                    },
                    {
                        "HostIp": "::",
                        "HostPort": "80"
                    }
                ]
            },
            "Networks": {
                "dokploy-network": {
                   ...
                }
            }
        }
    ]

That doesn't look like the case.


The traefik container knows the service container through a single network. Would be nice to know why the issue was intermittent connectivity instead of the container being unreachable from the get-go.

you can close this.

Let's hold that off until our uptime recovers. Keeping in mind that this issue is intermittent, a few hours of uptime is normal behaviour without this fix.

@theboringhumane
Copy link

No your service container is in 2 networks, now traefik is confused between which network to redirect traffic to. So if we specify the network it'll work smooth

@TheOnlyWayUp
Copy link
Author

My services are currently up! Thanks a ton @theboringhumane

I'll close the issue in 24h if it doesn't go down again.

@theboringhumane
Copy link

My services are currently up! Thanks a ton @theboringhumane

I'll close the issue in 24h if it doesn't go down again.

Happy to see it worked for you!

@TheOnlyWayUp
Copy link
Author

I followed @theboringhumane's solution as-is, #821 (comment), replacing the first few lines of my traefik.yml

The fix works flawlessly, I don't understand the mechanism behind it - but it does the trick!

This works without adding a random prefix to service names or converting to stack.

Thanks for the fix!

@theboringhumane
Copy link

I followed @theboringhumane's solution as-is, #821 (comment), replacing the first few lines of my traefik.yml

The fix works flawlessly, I don't understand the mechanism behind it - but it does the trick!

This works without adding a random prefix to service names or converting to stack.

Thanks for the fix!

Because now traefik knows to which network redirect the traffic. Because in compose if you a network defined other than dokploy network then you have to let the traefik know which is going to serve the http requests. Otherwise traefik will be waiting and you'll see a 504

@Viktor-as
Copy link

Go to traefik file system in your dokploy dashboard and do this

traefik.yml

providers:
swarm:
exposedByDefault: false
watch: true
docker:
exposedByDefault: false
watch: true
network: dokploy-network
@TheOnlyWayUp

The error comes from your networks, you created 2 networks and the authelia container is assigned to both of them. Traefik, while forwarding, doesn't know which network to use. So you have to specify it in your docker provider configuration:

Thanks for the solution!
Had the same problem when the Cloudflare domains work, then they stop, then start working randomly again....
After you make these config changes dont forget to restart the server (in UI go to Server Settings > Server > Reload )
For now it looks like its working fine, hope it will not stop again randomly.

I am grateful for this tool but it should take less then 3 days to setup it and debug... trying different solutions, different domains, different servers... I am lucky I found this issue and solution, this should not be closed bug, there is not a single word about this problem in documentation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants