-
-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent 504s and Poor Uptime on Docker Compose deployments #821
Comments
Services go down and come back up in a few minutes all throughout the day, it's tanked uptime to 30%. Forgejo Docker compose:
Ghost Docker compose
Umami:
These are the docker composes for affected services |
I believe it's an issue with Traefik, I can access the port-forwarded services (for example, Umami is forwarded to 4999 on the host and stats.towu.dev via traefik). When stats.towu.dev is down, I can still access host:4999 to see Umami, so I'm pretty confident it's a proxy issue. Something peculiar, while all the affected compose services go down at the same time (Ghost, Umami, and Forgejo). Other compose projects, like Immich, don't go down at all. Immich is a photo-management app which has a website as a part of the dockercompose, like the other services. Immich (no dowmtime) Dockercompose
|
Immich has downtime as well. Related documentation, https://docs.dokploy.com/docs/core/troubleshooting#docker-compose-domain-not-working version: '3'
services:
umami:
image: ghcr.io/umami-software/umami:postgresql-latest
...
expose:
- 3000
ports:
- - 4999:3000
+ - 3000
networks:
- default
db:
image: postgres:15-alpine
...
networks:
- default
networks:
default:
driver: bridge I'm trying this just to check, I need the ports forwarded as I can't upload large files through the cloudflare-proxied domain for Immich, for example. |
I know what could be the error, currently there is a very rare bug related to docker compose, if you use the name of a duplicate service in several places it is possible that the information is mixed somehow, I have not yet found a solution to this problem, my suggestion would be, change the name of the service services:
db:
..... to something like this services:
ghost-db:
..... |
I've updated my services to use prefixed names, I guess that's what the randomize compose is for. Is there anything I can do to provide some more insight? Traefik logs, if you lmk how I can get em. (docker logs would be enough?) Likely related, umami-software/umami#3080 (reply in thread) - I believe another service was attempting to access Umami's database, leading to that error. |
Oh, is it because all the containers are part of the |
@Siumauricio I updated the services to have unique names and rebuilt the project
|
the problem still persists? |
Experiencing very similar issues. I'm also using the cloud hosted version of dokploy instead of self hosted because I thought that might have been why. After doing some digging It's definitely the reverse proxy stuff. |
My server was down 5 min ago. I am not monitoring but I assume this is still a issue. |
@Siumauricio This issue is causing me a lot of trouble, is there anything I can do to help? |
Yes I definitely think it is a bug in docker at the network level, I think we must find a solution to this problem because currently we can not have 2 instances of the same template because sometimes it causes the information to be mixed which is a very strange behavior, I will investigate in more detail how to solve this, the idea would be to isolate the docker compose in a separate network. |
@Siumauricio I tried the fix in #1004 (randomize compose names) and the uptime hasn't improved at all. This issue is urgent and affecting my users. Broken networking is a dealbreaker, is there anything else I can try? Last ditch effort would be disabling Traefik and using a reverse proxy on host networking, or moving to another platform - which is a huge effort. Are there any blockers for this issue? Any logs or information you need? Anything? |
I'm having similiar problems - randomizing compose names also didn't fix it for me. I also suspect it has something to do with the same internal port which is published from similiar services/containers on the same Feel free to ping me as well if I can provide any logs, information or test something helpful to this issue. Read more...I have 2 different dokploy projects on one server - each containing 2 docker compose services. For example one docker compose is `nextcloud + mariadb + redis`. I get this problem despite the nextcloud webserver images having different docker image tags/versions in both projects. Whenever I deploy the service from the second project, the container of the first project is not reachable anymore with traefik error page "404 page not found". I also defined a custom-named network for each service (docker compose), so the database and webserver in a single docker compose can communicate:
There are also many other services running on my single server which are working fine and seem not to be affected by these beforementioned problamatic deployments. |
FYI: Our current workaround is to set the port of the application/webserver itself inside the container to something different for each service. So the similiar webservers which would normally all listen on port 80 now listen on 81, 82, 83, ... |
Something else I noticed, whenever the services are unreachable, I'm unable to view logs from the dokploy dashboard, it's just empty. The logs load when the service is available via the domain, which is weird, because it's reachable through port mappings regardless |
Go to traefik file system in your dokploy dashboard and do this traefik.yml providers:
swarm:
exposedByDefault: false
watch: true
docker:
exposedByDefault: false
watch: true
network: dokploy-network The error comes from your networks, you created 2 networks and the authelia container is assigned to both of them. Traefik, while forwarding, doesn't know which network to use. So you have to specify it in your docker provider configuration: |
@Siumauricio you can close this. |
Thanks! @theboringhumane . Have to observe it a bit more to be sure, but I suppose it's working now on my end. I switched all my services to type
|
Hey @theboringhumane, thanks for the solution! I'll try it and update the issue.
I have a few questions,
➜ ~ docker inspect dokploy-traefik.1.s2o77zzkq0hsqi8x7p837a8w8
[
{
...
"NetworkSettings": {
"Bridge": "",
"SandboxID": "a6e4d65f981460a574c39da307579f5e181301a20f21505d392970a6429a6073",
"SandboxKey": "/var/run/docker/netns/a6e4d65f9814",
"Ports": {
"443/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "443"
},
{
"HostIp": "::",
"HostPort": "443"
}
],
"80/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "80"
},
{
"HostIp": "::",
"HostPort": "80"
}
]
},
"Networks": {
"dokploy-network": {
...
}
}
}
] That doesn't look like the case. The traefik container knows the service container through a single network. Would be nice to know why the issue was intermittent connectivity instead of the container being unreachable from the get-go.
Let's hold that off until our uptime recovers. Keeping in mind that this issue is intermittent, a few hours of uptime is normal behaviour without this fix. |
No your service container is in 2 networks, now traefik is confused between which network to redirect traffic to. So if we specify the network it'll work smooth |
My services are currently up! Thanks a ton @theboringhumane I'll close the issue in 24h if it doesn't go down again. |
Happy to see it worked for you! |
I followed @theboringhumane's solution as-is, #821 (comment), replacing the first few lines of my traefik.yml The fix works flawlessly, I don't understand the mechanism behind it - but it does the trick! This works without adding a random prefix to service names or converting to stack. Thanks for the fix! |
Because now traefik knows to which network redirect the traffic. Because in compose if you a network defined other than dokploy network then you have to let the traefik know which is going to serve the http requests. Otherwise traefik will be waiting and you'll see a 504 |
Thanks for the solution! I am grateful for this tool but it should take less then 3 days to setup it and debug... trying different solutions, different domains, different servers... I am lucky I found this issue and solution, this should not be closed bug, there is not a single word about this problem in documentation... |
Fix: Thanks @theboringhumane! Update your traefik.yml with #821 (comment)
To Reproduce
This isn't an issue with UptimeKuma, because there are long periods of inactivity on my statistics as well.
Uptime stats with large blocks of empty:
Before moving composes to Dokploy
During a "downtime",
Current vs. Expected behavior
Provide environment information
Which area(s) are affected? (Select all that apply)
Docker Compose
Are you deploying the applications where Dokploy is installed or on a remote server?
Same server where Dokploy is installed
Additional context
This doesn't happen when deploying on the host system without Dokploy, circumventing traefik.
Will you send a PR to fix it?
Maybe, need help
The text was updated successfully, but these errors were encountered: