Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5.7 Server keeps restarting, panicking #13154

Closed
3 of 4 tasks
p53 opened this issue Jun 7, 2024 · 7 comments · Fixed by #13166
Closed
3 of 4 tasks

3.5.7 Server keeps restarting, panicking #13154

p53 opened this issue Jun 7, 2024 · 7 comments · Fixed by #13166
Assignees
Labels
area/api Argo Server API area/server P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@p53
Copy link

p53 commented Jun 7, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

we have several hundred workflows in our environment, doing listing workflows 20 req/s to check memory utilization i am getting container restarts with panic for argo-server pod, prior to this i see slow query warnings
argo-trace.zip

Version

v3.5.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

any simple workflow, create 1000 workflows and try to list 20req/s e.g. with firefox tab reloader

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@p53 p53 added the type/bug label Jun 7, 2024
@Joibel
Copy link
Member

Joibel commented Jun 7, 2024

The crash that I can see is a duplicate of #13140.
This issue does give a good way of reproducing this problem though, so thank you.

Slow queries are not discussed in #13140.

@p53
Copy link
Author

p53 commented Jun 7, 2024

i was checking it but there is different panic msg

@Joibel
Copy link
Member

Joibel commented Jun 7, 2024

Sorry, yes, it is a different panic. I'd be surprised if the root cause wasn't the same, the sqlite code is dealing with corrupted data.

@p53
Copy link
Author

p53 commented Jun 7, 2024

yup i guess it will be related

@Joibel
Copy link
Member

Joibel commented Jun 7, 2024

@jiachengxu - tagging you to make sure you've seen this. If you're working on it maybe this can help you reproduce.

@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Jun 7, 2024
@agilgur5 agilgur5 changed the title 3.5.7 containers keep restarting, panicking 3.5.7 Server keepa restarting, panicking Jun 7, 2024
@agilgur5 agilgur5 added area/api Argo Server API area/server labels Jun 7, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jun 7, 2024
@agilgur5 agilgur5 changed the title 3.5.7 Server keepa restarting, panicking 3.5.7 Server keeps restarting, panicking Jun 8, 2024
@Joibel
Copy link
Member

Joibel commented Jun 11, 2024

I can reproduce this with just putting enough workflows into a simple k3d single node cluster (started around 200 copies of examples/dag-diamond.yaml) and calling argo list. Occasionally that will crash in sqlite.

@Joibel
Copy link
Member

Joibel commented Jun 11, 2024

This stack trace implies we have a memory corruption problem in the server. Produced in the same way, using argo list with many dag-diamond.yaml (some running)

net.(*conn).Read(0xc0007b81e8, {0xc0009f4b00?, 0xc001501740?, 0xc002aecc38?})                                                                                                                                                                  
    /usr/local/go/src/net/net.go:179 +0x45 fp=0xc0015016d8 sp=0xc001501690 pc=0x5fe585                                                                                                                                                         
net.(*TCPConn).Read(0xc001501770?, {0xc0009f4b00?, 0xc002f14018?, 0x18?})                                                                                                                                                                      
    <autogenerated>:1 +0x25 fp=0xc001501708 sp=0xc0015016d8 pc=0x60f8c5                                                                                                                                                                        
crypto/tls.(*atLeastReader).Read(0xc002f14018, {0xc0009f4b00?, 0xc002f14018?, 0x0?})                                                                                                                                                           
    /usr/local/go/src/crypto/tls/conn.go:805 +0x3b fp=0xc001501750 sp=0xc001501708 pc=0x6567fb                                                                                                                                                 
bytes.(*Buffer).ReadFrom(0xc002aecd28, {0x3ce03a0, 0xc002f14018})                                                                                                                                                                              
    /usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc0015017a8 sp=0xc001501750 pc=0x51c9f8                                                                                                                                                    
crypto/tls.(*Conn).readFromUntil(0xc002aeca80, {0x3ce1aa0?, 0xc0007b81e8}, 0x580?)                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:827 +0xde fp=0xc0015017e8 sp=0xc0015017a8 pc=0x6569de                                                                                                                                                 
crypto/tls.(*Conn).readRecordOrCCS(0xc002aeca80, 0x0)                                                                                                                                                                                          
    /usr/local/go/src/crypto/tls/conn.go:625 +0x250 fp=0xc001501b88 sp=0xc0015017e8 pc=0x653fb0                                                                                                                                                
crypto/tls.(*Conn).readRecord(...)                                                                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:587                                                                                                                                                                                                   
crypto/tls.(*Conn).Read(0xc002aeca80, {0xc000980000, 0x8000, 0x1060100000000?})                                                                                                                                                                
    /usr/local/go/src/crypto/tls/conn.go:1369 +0x158 fp=0xc001501bf8 sp=0xc001501b88 pc=0x65a278                                                                                                                                               
github.com/soheilhy/cmux.(*bufferedReader).Read(0xc00017c010, {0xc000980000, 0xc001501c90?, 0x8000})                                                                                                                                           
    /go/pkg/mod/github.com/soheilhy/[email protected]/buffer.go:53 +0x12f fp=0xc001501c48 sp=0xc001501bf8 pc=0x1f8812f                                                                                                                               
github.com/soheilhy/cmux.(*MuxConn).Read(0x0?, {0xc000980000?, 0xc001501ca0?, 0x45d10d?})                                                                                                                                                      
    /go/pkg/mod/github.com/soheilhy/[email protected]/cmux.go:297 +0x1e fp=0xc001501c78 sp=0xc001501c48 pc=0x1f8965e                                                                                                                                 
bufio.(*Reader).Read(0xc0035ff980, {0xc0006da4a0, 0x9, 0xc1921ef224b271f3?})                                                                                                                                                                   
    /usr/local/go/src/bufio/bufio.go:244 +0x197 fp=0xc001501cb0 sp=0xc001501c78 pc=0x696c77                                                                                                                                                    
io.ReadAtLeast({0x3ce05c0, 0xc0035ff980}, {0xc0006da4a0, 0x9, 0x9}, 0x9)                                                                                                                                                                       
    /usr/local/go/src/io/io.go:335 +0x90 fp=0xc001501cf8 sp=0xc001501cb0 pc=0x4b9cf0                                                                                                                                                           
io.ReadFull(...)                                                                                                                                                                                                                               
    /usr/local/go/src/io/io.go:354                                                                                                                                                                                                             
golang.org/x/net/http2.readFrameHeader({0xc0006da4a0, 0x9, 0xc003120120?}, {0x3ce05c0?, 0xc0035ff980?})                                                                                                                                        
    /go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x65 fp=0xc001501d48 sp=0xc001501cf8 pc=0x779945                                                                                                                                  
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0006da460)                                                                                                                                                                                       
    /go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:498 +0x85 fp=0xc001501df0 sp=0xc001501d48 pc=0x77a085                                                                                                                                  
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc000d891e0, 0x1?)                                                                                                                                                     
    /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:636 +0x145 fp=0xc001501f00 sp=0xc001501df0 pc=0xf84325                                                                                                       
google.golang.org/grpc.(*Server).serveStreams(0xc00023e000, {0x3d1cf40?, 0xc000d891e0})                                                                                                                                                        
    /go/pkg/mod/google.golang.org/[email protected]/server.go:979 +0x1c2 fp=0xc001501f80 sp=0xc001501f00 pc=0xfd5702                                                                                                                                
google.golang.org/grpc.(*Server).handleRawConn.func1()                                                                                                                                                                                         
    /go/pkg/mod/google.golang.org/[email protected]/server.go:920 +0x45 fp=0xc001501fe0 sp=0xc001501f80 pc=0xfd4f65                                                                                                                                 
runtime.goexit()                                                                                                                                                                                                                               
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001501fe8 sp=0xc001501fe0 pc=0x4712e1                                                                                                                                                
created by google.golang.org/grpc.(*Server).handleRawConn in goroutine 656                                                                                                                                                                     
    /go/pkg/mod/google.golang.org/[email protected]/server.go:919 +0x185                                                                                                                                                                            
                                                                                                                                                                                                                                               
goroutine 487 [select]:                                                                                                                                                                                                                        
runtime.gopark(0xc001505f90?, 0x2?, 0xe0?, 0x5d?, 0xc001505f1c?)                                                                                                                                                                               
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc001505db8 sp=0xc001505d98 pc=0x43e26e                                                                                                                                                    
runtime.selectgo(0xc001505f90, 0xc001505f18, 0xc0007c0180?, 0x0, 0xc0031288a0?, 0x1)                                                                                                                                                           
    /usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc001505ed8 sp=0xc001505db8 pc=0x44e6a5                                                                                                                                                 
net/http.(*persistConn).writeLoop(0xc00178c120)                                                                                                                                                                                                
    /usr/local/go/src/net/http/transport.go:2421 +0xe5 fp=0xc001505fc8 sp=0xc001505ed8 pc=0x72d605                                                                                                                                             
net/http.(*Transport).dialConn.func6()                                                                                                                                                                                                         
    /usr/local/go/src/net/http/transport.go:1777 +0x25 fp=0xc001505fe0 sp=0xc001505fc8 pc=0x72a405                                                                                                                                             
runtime.goexit()                                                                                                                                                                                                                               
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001505fe8 sp=0xc001505fe0 pc=0x4712e1                                                                                                                                                
created by net/http.(*Transport).dialConn in goroutine 517                                                                                                                                                                                     
    /usr/local/go/src/net/http/transport.go:1777 +0x16f1                                                             

                                                                                                                                                                                                                                               
goroutine 486 [IO wait]:                                                                                                                                                                                                                       
runtime.gopark(0xbf97d9ec25bb9557?, 0xb?, 0x0?, 0x0?, 0xd?)                                                                                                                                                                                    
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000ad15c8 sp=0xc000ad15a8 pc=0x43e26e                                                                                                                                                    
runtime.netpollblock(0x4c5158?, 0x407de6?, 0x0?)                                                                                                                                                                                               
    /usr/local/go/src/runtime/netpoll.go:564 +0xf7 fp=0xc000ad1600 sp=0xc000ad15c8 pc=0x436cf7                                                                                                                                                 
internal/poll.runtime_pollWait(0x7fca5da77148, 0x72)                                                                                                                                                                                           
    /usr/local/go/src/runtime/netpoll.go:343 +0x85 fp=0xc000ad1620 sp=0xc000ad1600 pc=0x46b905                                                                                                                                                 
internal/poll.(*pollDesc).wait(0xc0019ac680?, 0xc0009f4000?, 0x0)                                                                                                                                                                              
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000ad1648 sp=0xc000ad1620 pc=0x4e2ec7                                                                                                                                    
internal/poll.(*pollDesc).waitRead(...)                                                                                                                                                                                                        
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89                                                                                                                                                                                      
internal/poll.(*FD).Read(0xc0019ac680, {0xc0009f4000, 0x580, 0x580})                                                                                                                                                                           
    /usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000ad16e0 sp=0xc000ad1648 pc=0x4e41ba                                                                                                                                          
net.(*netFD).Read(0xc0019ac680, {0xc0009f4000?, 0xc0009f4005?, 0x3e6?})                                                                                                                                                                        
    /usr/local/go/src/net/fd_posix.go:55 +0x25 fp=0xc000ad1728 sp=0xc000ad16e0 pc=0x5ec9a5                                                                                                                                                     
net.(*conn).Read(0xc0007b8128, {0xc0009f4000?, 0xc000295a01?, 0xc002aec538?})                                                                                                                                                                  
    /usr/local/go/src/net/net.go:179 +0x45 fp=0xc000ad1770 sp=0xc000ad1728 pc=0x5fe585                                                                                                                                                         
net.(*TCPConn).Read(0xc000ad1808?, {0xc0009f4000?, 0xc002f140d8?, 0x18?})                                                                                                                                                                      
    <autogenerated>:1 +0x25 fp=0xc000ad17a0 sp=0xc000ad1770 pc=0x60f8c5                                                                                                                                                                        
crypto/tls.(*atLeastReader).Read(0xc002f140d8, {0xc0009f4000?, 0xc002f140d8?, 0x0?})                                                                                                                                                           
    /usr/local/go/src/crypto/tls/conn.go:805 +0x3b fp=0xc000ad17e8 sp=0xc000ad17a0 pc=0x6567fb                                                                                                                                                 
bytes.(*Buffer).ReadFrom(0xc002aec628, {0x3ce03a0, 0xc002f140d8})                                                                                                                                                                              
    /usr/local/go/src/bytes/buffer.go:211 +0x98 fp=0xc000ad1840 sp=0xc000ad17e8 pc=0x51c9f8                                                                                                                                                    
crypto/tls.(*Conn).readFromUntil(0xc002aec380, {0x3ce1aa0?, 0xc0007b8128}, 0x580?)                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:827 +0xde fp=0xc000ad1880 sp=0xc000ad1840 pc=0x6569de                                                                                                                                                 
crypto/tls.(*Conn).readRecordOrCCS(0xc002aec380, 0x0)                                                                                                                                                                                          
    /usr/local/go/src/crypto/tls/conn.go:625 +0x250 fp=0xc000ad1c20 sp=0xc000ad1880 pc=0x653fb0                                                                                                                                                
crypto/tls.(*Conn).readRecord(...)                                                                                                                                                                                                             
    /usr/local/go/src/crypto/tls/conn.go:587                                                                                                                                                                                                   
crypto/tls.(*Conn).Read(0xc002aec380, {0xc00098a000, 0x1000, 0xd?})                                                                                                                                                                            
    /usr/local/go/src/crypto/tls/conn.go:1369 +0x158 fp=0xc000ad1c90 sp=0xc000ad1c20 pc=0x65a278                                                                                                                                               
net/http.(*persistConn).Read(0xc00178c120, {0xc00098a000?, 0xc000868540?, 0xc000ad1d38?})                                                                                                                                                      
    /usr/local/go/src/net/http/transport.go:1954 +0x4a fp=0xc000ad1cf0 sp=0xc000ad1c90 pc=0x72ae4a                                                                                                                                             
bufio.(*Reader).fill(0xc0013c1380)                                                                                                                                                                                                             
    /usr/local/go/src/bufio/bufio.go:113 +0x103 fp=0xc000ad1d28 sp=0xc000ad1cf0 pc=0x696743                                                                                                                                                    
bufio.(*Reader).Peek(0xc0013c1380, 0x1)                                                                                                                                                                                                        
    /usr/local/go/src/bufio/bufio.go:151 +0x53 fp=0xc000ad1d48 sp=0xc000ad1d28 pc=0x696873                                                                                                                                                     
net/http.(*persistConn).readLoop(0xc00178c120)                                                                                                                                                                                                 
    /usr/local/go/src/net/http/transport.go:2118 +0x1b9 fp=0xc000ad1fc8 sp=0xc000ad1d48 pc=0x72bc39                                                                                                                                            
net/http.(*Transport).dialConn.func5()                                                                                                                                                                                                         
    /usr/local/go/src/net/http/transport.go:1776 +0x25 fp=0xc000ad1fe0 sp=0xc000ad1fc8 pc=0x72a465                                                                                                                                             
runtime.goexit()                                                                                                                                                                                                                               
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000ad1fe8 sp=0xc000ad1fe0 pc=0x4712e1                                                                                                                                                
created by net/http.(*Transport).dialConn in goroutine 517                                                                                                                                                                                     
    /usr/local/go/src/net/http/transport.go:1776 +0x169f    

@Joibel Joibel added the P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority label Jun 11, 2024
@Joibel Joibel self-assigned this Jun 11, 2024
Joibel added a commit to Joibel/argo-workflows that referenced this issue Jun 11, 2024
[zombiezen/go-sqlite]
(https://github.com/zombiezen/go-sqlite/blob/main/doc.go#L32) is not
thread safe when used through a single connection. The current code is
provably racing (run the server with `-race` and a few workflows being
run) and it will tell you this if you `argo list` via the server a few
times.

This change doesn't attempt to move to a multiple connection model,
it's a minimal change to stop the server crashing all the time, by
mutexing the use of the sql connection.

Fixes argoproj#13154 and argoproj#13140

Signed-off-by: Alan Clucas <[email protected]>
@argoproj argoproj locked as resolved and limited conversation to collaborators Sep 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/api Argo Server API area/server P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants