-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.5.7 Server keeps restarting, panicking #13154
Comments
i was checking it but there is different panic msg |
Sorry, yes, it is a different panic. I'd be surprised if the root cause wasn't the same, the sqlite code is dealing with corrupted data. |
yup i guess it will be related |
@jiachengxu - tagging you to make sure you've seen this. If you're working on it maybe this can help you reproduce. |
I can reproduce this with just putting enough workflows into a simple k3d single node cluster (started around 200 copies of |
This stack trace implies we have a memory corruption problem in the server. Produced in the same way, using
|
[zombiezen/go-sqlite] (https://github.com/zombiezen/go-sqlite/blob/main/doc.go#L32) is not thread safe when used through a single connection. The current code is provably racing (run the server with `-race` and a few workflows being run) and it will tell you this if you `argo list` via the server a few times. This change doesn't attempt to move to a multiple connection model, it's a minimal change to stop the server crashing all the time, by mutexing the use of the sql connection. Fixes argoproj#13154 and argoproj#13140 Signed-off-by: Alan Clucas <[email protected]>
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
we have several hundred workflows in our environment, doing listing workflows 20 req/s to check memory utilization i am getting container restarts with panic for argo-server pod, prior to this i see slow query warnings
argo-trace.zip
Version
v3.5.7
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
any simple workflow, create 1000 workflows and try to list 20req/s e.g. with firefox tab reloader
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: