Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault running notcurses-info on alpine drone builder #2828

Closed
dankamongmen opened this issue Jan 2, 2025 · 63 comments
Closed

segfault running notcurses-info on alpine drone builder #2828

dankamongmen opened this issue Jan 2, 2025 · 63 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@dankamongmen
Copy link
Owner

See e.g. https://drone.dsscaw.com:4443/dankamongmen/notcurses/10721/4/2. I've brought all our autobuilders back up to speed, but we're not successfully running tests on alpine edge. We're doing fine on the other runners.

@dankamongmen dankamongmen added the bug Something isn't working label Jan 2, 2025
@dankamongmen dankamongmen added this to the 3.1.0 milestone Jan 2, 2025
@dankamongmen dankamongmen self-assigned this Jan 2, 2025
@dankamongmen
Copy link
Owner Author

we also get a warning regarding ffmpeg on alpine:

[ 26%] Building C object CMakeFiles/notcurses-static.dir/src/media/ffmpeg.c.o
/home/dank/git/src/media/ffmpeg.c: In function 'ffmpeg_pkt_duration':
/home/dank/git/src/media/ffmpeg.c:45:7: warning: 'pkt_duration' is deprecated [-Wdeprecated-declarations]
   45 |       return frame->pkt_duration;
      |       ^~~~~~
In file included from /home/dank/git/src/media/ffmpeg.c:4:
/usr/include/libavutil/frame.h:700:13: note: declared here
  700 |     int64_t pkt_duration;
      |             ^~~~~~~~~~~~

@dankamongmen
Copy link
Owner Author

that warning suggests a pretty old version of libav (<59). is it possible that our runner is simply out of date?

uint64_t ffmpeg_pkt_duration(const AVFrame* frame){                                                                                 
#if LIBAVUTIL_VERSION_MAJOR < 59                                                                                                    
      return frame->pkt_duration;                                                                                                   
#else                                                                                                                               
      return frame->duration;                                                                                                       
#endif                                                                                                                              
}     

btw apparently we ought be using av_frame_get_pkt_duration() to get this, according to the docs at https://ffmpeg.org/doxygen/2.7/structAVFrame.html ..but that identifier doesn't appear to actually exist in current ffmpeg, lol.

@dankamongmen
Copy link
Owner Author

the versions installed look reasonable...

ffmpeg-dev-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libavcodec-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libavdevice-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libavfilter-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libavformat-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libavutil-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libpostproc-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libswresample-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]
ffmpeg-libswscale-6.1.2-r1 x86_64 {ffmpeg} (GPL-2.0-or-later AND LGPL-2.1-or-later) [installed]

@dankamongmen
Copy link
Owner Author

doh i was looking at the 2.7 ffmpeg docs, idiot

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 2, 2025

/usr/include/libavutil/version.h:#define LIBAVUTIL_VERSION_MAJOR  58
/usr/include/libavutil/version.h:#define LIBAVUTIL_VERSION_MINOR  29
/usr/include/libavutil/version.h:#define LIBAVUTIL_VERSION_MICRO 100

a surprisingly weird libavutil version in the Alpine headers!

@dankamongmen
Copy link
Owner Author

meanwhile, on my /usr/local:

/usr/local/include/libavutil/version.h:#define LIBAVUTIL_VERSION_MAJOR  59
/usr/local/include/libavutil/version.h:#define LIBAVUTIL_VERSION_MINOR  53
/usr/local/include/libavutil/version.h:#define LIBAVUTIL_VERSION_MICRO 100

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 2, 2025

looks like we can just relax the version selector; duration is present and usable in 58. back to the real problem, which i am now reproducing in the Alpine runner...

/home/dank/git/build # ./notcurses-info 
notcurses 3.0.11 on Kitty 0.38.1 (Linux 6.12.1nlb2)
74 rows (19px) 132 cols (10px) 1406x1320 rgb+8 colors
gcc-14.2.0 (LE)
terminfo 6.5.20241006 libdeflate 1.23 GPM n/a
af+ ab+ sum- vpa+ hpa+ sgr0+ op+ fgop+ bgop+ bce+ rect-                         0.3.100
af+ ab+ sum- vpa+ hpa+ sgr0+ op+ fgop+ bgop+ bce+ rect-                         
bold+ ital+ struck+ ucurl+ uline+ u7+ ccc- rgb+ el+                             
utf8+ 2x1+ 2x2+ 3x2+ 4x2- 4x2+ img+ vid+ indn+ gpm- kbd+                        
default fg 0xffffff default bg 0x080808 pmouse+                                 
1st gen rgba pixel animation support                   🯁🯂🯃https://notcurses.com 
 ▘▝▀▖▌▞▛▗▚▐▜▄▙▟█⎧ 🬀🬁🬂🬃🬄🬅🬆🬇🬈🬊🬋🬌🬍🬎🬏🬐🬑🬒🬓▌🬔🬕🬖🬗🬘🬙🬚🬛🬜🬝⎫♠♥🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹⅗⅘⅙⅚⅛⎧▕▏⎫┌╥─╥─╥┐🭩⎛⎞
╲╿╱ ◨◧ ◪◩ ◖◗ ⫷⫸ ⎩🬟🬠🬡🬢🬣🬤🬥🬦🬧▐🬨🬩🬪🬫🬬🬭🬮🬯🬰🬱🬲🬳🬴🬵🬶🬷🬸🬹🬺🬻█⎭♦♣¼½¾⅐⅑⅒⅓⅔⅕⅖⅜⅝⅞⅟↉⎪🮇▎⎪├╜╓╫╖╙┤🭫⎜⎟
╾╳╼ ◲◱ ◶◵ 🮣🮠 🮤🮥◜◝ ◿◺ 🮞🮟 ◢◣ ┌┐─ ┏┓━ ╭╮─ ╔╗═ 🭽🭾▁♟♜♞⩘▵△▹▷▿▽◃◁⭡⭣⭠⭢⭧⭩⭦⭨⎪🮈▍⎪├─╨╫╨─┤┇⎜⎟
╱╽╲ ◳◰ ◷◴ 🮡🮢 🮦🮧◟◞ ◹◸ 🮝🮜 ◥◤ └┘│ ┗┛┃ ╰╯│ ╚╝║ 🭼🭿🭵♝♛♚⩗▴⏶⯅▲▸⏵⯈▶▾⏷⯆▼◂⏴⯇◀⎪▐▌⎪╞═╤╬╤═╡┋⎜⎟
⎡⠀⠁⠈⠉⠂⠃⠊⠋⠐⠑⠘⠙⠒⠓⠚⠛⠄⠅⠌⠍⠆⠇⠎⠏⠔⠕⠜⠝⠖⠗⠞⠟⠠⠡⠨⠩⠢⠣⠪⠫⠰⠱⠸⠹⠲⠳⠺⠻⠤⠥⠬⠭⠦⠧⠮⠯⠴⠵⠼⠽⠶⠷⠾⠿⎤⎨🮉▋⎬╞╕╘╬╛╒╡┊⎜⎟
⎢⡀⡁⡈⡉⡂⡃⡊⡋⡐⡑⡘⡙⡒⡓⡚⡛⡄⡅⡌⡍⡆⡇⡎⡏⡔⡕⡜⡝⡖⡗⡞⡟⡠⡡⡨⡩⡢⡣⡪⡫⡰⡱⡸⡹⡲⡳⡺⡻⡤⡥⡬⡭⡦⡧⡮⡯⡴⡵⡼⡽⡶⡷⡾⡿⎥⎪🮊▊⎪└┴─╨─┴┘╏⎝⎠
⎢⢀⢁⢈⢉⢂⢃⢊⢋⢐⢑⢘⢙⢒⢓⢚⢛⢄⢅⢌⢍⢆⢇⢎⢏⢔⢕⢜⢝⢖⢗⢞⢟⢠⢡⢨⢩⢢⢣⢪⢫⢰⢱⢸⢹⢲⢳⢺⢻⢤⢥⢬⢭⢦⢧⢮⢯⢴⢵⢼⢽⢶⢷⢾⢿⎥⎪🮋▉⎪╭──╮⟬⟭╔╗≶≷
⎣⣀⣁⣈⣉⣂⣃⣊⣋⣐⣑⣘⣙⣒⣓⣚⣛⣄⣅⣌⣍⣆⣇⣎⣏⣔⣕⣜⣝⣖⣗⣞⣟⣠⣡⣨⣩⣢⣣⣪⣫⣰⣱⣸⣹⣲⣳⣺⣻⣤⣥⣬⣭⣦⣧⣮⣯⣴⣵⣼⣽⣶⣷⣾⣿⎦⎪██⎪│╭╮│╔═╝║⊆⊇
 ▔🭶🭷🭸🭹🭺🭻▁ 🭁🭌 🭂🭍 🭃🭎 🭄🭏 🭅🭐 🭆🭑 🭇🬼 🭈🬽 🭉🬾 🭊🬿 🭋🭀 ₀₁₂₃₄₅₆₇₈₉ ⎛ ▁▂▃▄▅▆▇█🭫⎞⎪🭨🭪⎪╰╯││║╔═╝⊴⊵
 ▏🭰🭱🭲🭳🭴🭵▕ 🭒🭝 🭓🭞 🭔🭟 🭕🭠 🭖🭡 🭧🭜 🭢🭗 🭣🭘 🭤🭙 🭥🭚 🭦🭛 ⁰¹²³⁴⁵⁶⁷⁸⁹ ⎝ ▔🮂🮃▀🮄🮅🮆█🭩⎠⎩🭪🭨⎭⧒⧑╰╯╚╝❨❩⟃⟄                     
 
Segmentation fault (core dumped)
/home/dank/git/build # 

@dankamongmen
Copy link
Owner Author

note we're reproducing here with Kitty, suggesting this is not term-dependent

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 2, 2025

looks like all binaries are segfaulting on alpine, ugh

@dankamongmen
Copy link
Owner Author

Thread 2 "notcurses-info" received signal SIG33, Real-time event 33.
[Switching to LWP 2421]
__cp_end () at src/thread/x86_64/syscall_cp.s:29
warning: 29	src/thread/x86_64/syscall_cp.s: No such file or directory
(gdb) bt
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x00007efff68e3cb5 in __syscall_cp_c (nr=271, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, 
    y=<optimized out>, z=0) at src/thread/pthread_cancel.c:33
#2  0x00007efff68dbaf7 in ppoll (fds=fds@entry=0x7effed5ff520, n=n@entry=1, to=to@entry=0x0, mask=mask@entry=0x7effed5ff530)
    at src/select/ppoll.c:24
#3  0x00007efff68170b1 in block_on_input (ictx=<optimized out>, rtfd=<synthetic pointer>, rifd=<synthetic pointer>)
    at /home/dank/git/src/lib/in.c:2551
#4  read_inputs_nblock (ictx=<optimized out>) at /home/dank/git/src/lib/in.c:2589
#5  input_thread (vmarshall=0x7effed871ab0) at /home/dank/git/src/lib/in.c:2620
#6  0x00007efff68e49d2 in start (p=0x7effed5ff620) at src/thread/pthread_create.c:207
#7  0x00007efff68e6314 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC
(gdb) 

@dankamongmen
Copy link
Owner Author

same in zalgo:

Thread 2 "zalgo" received signal SIG33, Real-time event 33.
[Switching to LWP 2435]
__cp_end () at src/thread/x86_64/syscall_cp.s:29
warning: 29	src/thread/x86_64/syscall_cp.s: No such file or directory
(gdb) bt
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x00007f935e52ecb5 in __syscall_cp_c (nr=271, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, 
    y=<optimized out>, z=0) at src/thread/pthread_cancel.c:33
#2  0x00007f935e526af7 in ppoll (fds=fds@entry=0x7f935e1bda00, n=n@entry=1, to=to@entry=0x0, mask=mask@entry=0x7f935e1bda10)
    at src/select/ppoll.c:24
#3  0x00007f935e4690b1 in block_on_input (ictx=<optimized out>, rtfd=<synthetic pointer>, rifd=<synthetic pointer>)
    at /home/dank/git/src/lib/in.c:2551
#4  read_inputs_nblock (ictx=<optimized out>) at /home/dank/git/src/lib/in.c:2589
#5  input_thread (vmarshall=0x7f935e1fab30) at /home/dank/git/src/lib/in.c:2620
#6  0x00007f935e52f9d2 in start (p=0x7f935e1bdb00) at src/thread/pthread_create.c:207
#7  0x00007f935e531314 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC
(gdb) 

@dankamongmen
Copy link
Owner Author

i wouldn't be shocked if this is due to Musl's implementation of pthread_timedjoin_np()...

so we're getting hit with SIG33 ("real-time event 33", but iirc musl and gllibc have weird differences here, see https://sourceware.org/bugzilla/show_bug.cgi?id=23616) in thread 2 while thread 1 is performing a pthread_timedjoin_np(), then we get hit with the SIGSEGV:

Thread 2 "zalgo" received signal SIG33, Real-time event 33.
[Switching to LWP 2455]
__cp_end () at src/thread/x86_64/syscall_cp.s:29
warning: 29	src/thread/x86_64/syscall_cp.s: No such file or directory
(gdb) thread apply all bt

Thread 2 (LWP 2455 "zalgo"):
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x00007efd465c8cb5 in __syscall_cp_c (nr=271, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0) at src/thread/pthread_cancel.c:33
#2  0x00007efd465c0af7 in ppoll (fds=fds@entry=0x7efd46257a00, n=n@entry=1, to=to@entry=0x0, mask=mask@entry=0x7efd46257a10) at src/select/ppoll.c:24
#3  0x00007efd465030b1 in block_on_input (ictx=<optimized out>, rtfd=<synthetic pointer>, rifd=<synthetic pointer>) at /home/dank/git/src/lib/in.c:2551
#4  read_inputs_nblock (ictx=<optimized out>) at /home/dank/git/src/lib/in.c:2589
#5  input_thread (vmarshall=0x7efd46294b30) at /home/dank/git/src/lib/in.c:2620
#6  0x00007efd465c99d2 in start (p=0x7efd46257b00) at src/thread/pthread_create.c:207
#7  0x00007efd465cb314 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

Thread 1 (LWP 2454 "zalgo"):
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x00007efd465c8cb5 in __syscall_cp_c (nr=202, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0) at src/thread/pthread_cancel.c:33
#2  0x00007efd465c8283 in __futex4_cp (addr=0x7efd46257b70, op=128, val=2, to=<optimized out>) at src/thread/__timedwait.c:24
#3  __timedwait_cp (addr=addr@entry=0x7efd46257b70, val=2, clk=clk@entry=0, at=at@entry=0x0, priv=128, priv@entry=1) at src/thread/__timedwait.c:52
#4  0x00007efd465c9cfd in __pthread_timedjoin_np (t=t@entry=0x7efd46257b38, res=res@entry=0x0, at=at@entry=0x0) at src/thread/pthread_join.c:18
#5  0x00007efd465c9d6b in __pthread_join (t=t@entry=0x7efd46257b38, res=res@entry=0x0) at src/thread/pthread_join.c:30
#6  0x00007efd4650527d in cancel_and_join (name=0x7efd46547473 "input", res=0x0, tid=0x7efd46257b38) at /home/dank/git/src/lib/internal.h:1840
#7  stop_inputlayer (ti=ti@entry=0x7efd46299320) at /home/dank/git/src/lib/in.c:2651
#8  0x00007efd46536a7e in free_terminfo_cache (ti=ti@entry=0x7efd46299320) at /home/dank/git/src/lib/termdesc.c:198
#9  0x00007efd46514f75 in notcurses_stop (nc=nc@entry=0x7efd46299020) at /home/dank/git/src/lib/notcurses.c:1466
#10 0x00005652382cf245 in main () at /home/dank/git/src/poc/zalgo.c:29
(gdb) cont
Continuing.

Thread 2 "zalgo" received signal SIGSEGV, Segmentation fault.
__cp_end () at src/thread/x86_64/syscall_cp.s:29
29	in src/thread/x86_64/syscall_cp.s
(gdb) thread apply all bt

Thread 2 (LWP 2455 "zalgo"):
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x00007efd465c8cb5 in __syscall_cp_c (nr=271, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0) at src/thread/pthread_cancel.c:33
#2  0x00007efd465c0af7 in ppoll (fds=fds@entry=0x7efd46257a00, n=n@entry=1, to=to@entry=0x0, mask=mask@entry=0x7efd46257a10) at src/select/ppoll.c:24
#3  0x00007efd465030b1 in block_on_input (ictx=<optimized out>, rtfd=<synthetic pointer>, rifd=<synthetic pointer>) at /home/dank/git/src/lib/in.c:2551
#4  read_inputs_nblock (ictx=<optimized out>) at /home/dank/git/src/lib/in.c:2589
#5  input_thread (vmarshall=0x7efd46294b30) at /home/dank/git/src/lib/in.c:2620
#6  0x00007efd465c99d2 in start (p=0x7efd46257b00) at src/thread/pthread_create.c:207
#7  0x00007efd465cb314 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

Thread 1 (LWP 2454 "zalgo"):
#0  __cp_end () at src/thread/x86_64/syscall_cp.s:29
#1  0x00007efd465c8cb5 in __syscall_cp_c (nr=202, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=0) at src/thread/pthread_cancel.c:33
#2  0x00007efd465c8283 in __futex4_cp (addr=0x7efd46257b70, op=128, val=2, to=<optimized out>) at src/thread/__timedwait.c:24
#3  __timedwait_cp (addr=addr@entry=0x7efd46257b70, val=2, clk=clk@entry=0, at=at@entry=0x0, priv=128, priv@entry=1) at src/thread/__timedwait.c:52
#4  0x00007efd465c9cfd in __pthread_timedjoin_np (t=t@entry=0x7efd46257b38, res=res@entry=0x0, at=at@entry=0x0) at src/thread/pthread_join.c:18
#5  0x00007efd465c9d6b in __pthread_join (t=t@entry=0x7efd46257b38, res=res@entry=0x0) at src/thread/pthread_join.c:30
#6  0x00007efd4650527d in cancel_and_join (name=0x7efd46547473 "input", res=0x0, tid=0x7efd46257b38) at /home/dank/git/src/lib/internal.h:1840
#7  stop_inputlayer (ti=ti@entry=0x7efd46299320) at /home/dank/git/src/lib/in.c:2651
#8  0x00007efd46536a7e in free_terminfo_cache (ti=ti@entry=0x7efd46299320) at /home/dank/git/src/lib/termdesc.c:198
#9  0x00007efd46514f75 in notcurses_stop (nc=nc@entry=0x7efd46299020) at /home/dank/git/src/lib/notcurses.c:1466
#10 0x00005652382cf245 in main () at /home/dank/git/src/poc/zalgo.c:29
(gdb) 

dankamongmen referenced this issue in lsds/musl Jan 2, 2025
like close, pthread_join is a resource-deallocation function which is
also a cancellation point. the intent of masked cancellation mode is
to exempt such functions from failure with ECANCELED.
@dankamongmen
Copy link
Owner Author

long __cancel()                                                                                                                     
{                                                                                                                                   
  pthread_t self = __pthread_self();                                                                                                
  if (self->canceldisable == PTHREAD_CANCEL_ENABLE || self->cancelasync)                                                            
    pthread_exit(PTHREAD_CANCELED);                                                                                                 
  self->canceldisable = PTHREAD_CANCEL_DISABLE;                                                                                     
  return -ECANCELED;                                                                                                                
}                                                                                                                                   
                                                                                                                                    
long __syscall_cp_asm(volatile void *, syscall_arg_t,                                                                               
                      syscall_arg_t, syscall_arg_t, syscall_arg_t,                                                                  
                      syscall_arg_t, syscall_arg_t, syscall_arg_t);                                                                 
                                                                                                                                    
long __syscall_cp_c(syscall_arg_t nr,                                                                                               
                    syscall_arg_t u, syscall_arg_t v, syscall_arg_t w,                                                              
                    syscall_arg_t x, syscall_arg_t y, syscall_arg_t z)                                                              
{                                                                                                                                   
  pthread_t self;                                                                                                                   
  long r;                                                                                                                           
  int st;                                                                                                                           
                                                                                                                                    
  if ((st=(self=__pthread_self())->canceldisable)                                                                                   
      && (st==PTHREAD_CANCEL_DISABLE || nr==SYS_close))                                                                             
    return __syscall(nr, u, v, w, x, y, z);                                                                                         
                                                                                                                                    
  r = __syscall_cp_asm(&self->cancel, nr, u, v, w, x, y, z);                                                                        
  if (r==-EINTR && nr!=SYS_close && self->cancel &&                                                                                 
      self->canceldisable != PTHREAD_CANCEL_DISABLE)                                                                                
    r = __cancel();                                                                                                                 
  return r;                                                                                                                         
}

__syscall_cp_asm:                                                                                                                   
                                                                                                                                    
__cp_begin:                                                                                                                         
  mov (%rdi),%eax                                                                                                                   
  test %eax,%eax                                                                                                                    
  jnz __cp_cancel                                                                                                                   
  mov %rdi,%r11                                                                                                                     
  mov %rsi,%rax                                                                                                                     
  mov %rdx,%rdi                                                                                                                     
  mov %rcx,%rsi                                                                                                                     
  mov %r8,%rdx                                                                                                                      
  mov %r9,%r10                                                                                                                      
  mov 8(%rsp),%r8                                                                                                                   
  mov 16(%rsp),%r9                                                                                                                  
  mov %r11,8(%rsp)                                                                                                                  
  syscall                                                                                                                           
__cp_end:                                                                                                                           
  ret                                                                                                                               
__cp_cancel:                                                                                                                        
  jmp __cancel                                                                                                                      
~                 

@dankamongmen
Copy link
Owner Author

hrmmm, how are we getting from ppoll() to pthread_cancel() in thread 2 (the segfaulting thread)? why would the thread being canceled be calling pthread_cancel()? or is that just a cleanup handler? both threads seem to be at pthread_cancel.c:33...

@dankamongmen
Copy link
Owner Author

Azure/azure-sphere-gallery#131 hrrrrrmrmm

@richfelker
Copy link

I'm guessing you just have a stack overflow. The segfault at the syscall exit is what I'd expect if the signal delivery event (SIGCANCEL) didn't have space to allocate a stack frame.

@dankamongmen
Copy link
Owner Author

verified that self->cancel is 1 for thread 2 and 0 for thread 1, as expected. so self seems valid here.

@dankamongmen
Copy link
Owner Author

I'm guessing you just have a stack overflow. The segfault at the syscall exit is what I'd expect if the signal delivery event (SIGCANCEL) didn't have space to allocate a stack frame.

hey, thanks for taking a look @richfelker ! btw i figured out the PTHREAD_CANCEL_MASKED mystery. thanks for your continued good work!

@dankamongmen
Copy link
Owner Author

and stack overflow sounds very much like a musl vs glibc difference, which is what we're seeing here -- let's proceed under the assumption @richfelker is correct (generally a safe assumption). what big allocations are we making on the input thread?

@dankamongmen
Copy link
Owner Author

oooh i notice we're calling setup_alt_sig_stack(); in input_thread(), probably worth keeping in mind...

@dankamongmen
Copy link
Owner Author

commenting out setup_alt_sig_stack() results in our binaries running to successful completion on alpine. alright! let's see what's going on there (gross things surely; it's UNIX signals).

@dankamongmen
Copy link
Owner Author

// the alternate signal stack is a thread property; any other threads we                                                            
// create ought go ahead and install the same alternate signal stack.                                                               
void setup_alt_sig_stack(void){                                                                                                     
  pthread_mutex_lock(&lock);                                                                                                        
  if(alt_signal_stack.ss_sp){                                                                                                       
    sigaltstack(&alt_signal_stack, NULL);                                                                                           
  }                                                                                                                                 
  pthread_mutex_unlock(&lock);                                                                                                      
}           

this smells like a questionable assumption...

@dankamongmen
Copy link
Owner Author

possibly relevant: https://www.openwall.com/lists/musl/2019/03/05/2

@dankamongmen
Copy link
Owner Author

fwiw i do not see any big allocations on this stack. so that pretty much cements this as altstack-related bad magics imho. proceeding under that assumption.

@dankamongmen
Copy link
Owner Author

@richfelker you might be interested in this alternative (pun unintended) path to the error we're hitting. investigating...

@dankamongmen
Copy link
Owner Author

note that this is happening after notcurses_stop_minimal() has returned:

reset_term_palette:82:restoring palette via xtpopcolors
notcurses_stop_minimal:174:restored terminal, returning 0
notcurses_drop_planes:1432:we have some planes
ncpile_drop:1413:killing plane 0x7f6c85007e40, next is 0
notcurses_drop_planes:1440:all planes dropped
stop_inputlayer:2650:tearing down input thread
Segmentation fault (core dumped)

@dankamongmen
Copy link
Owner Author

hrmm this is happening after we've called drop_signals(), as well. so the alternate stack ought not even be in play. yet the problem goes away if we don't call setup_alt_signal_stack()...hrmmm.

@dankamongmen
Copy link
Owner Author

if i comment out the sigaltstack() undo in drop_signals(), we run to completion!

@dankamongmen
Copy link
Owner Author

i can leave the sigaltstack(SS_DISABLE) in, and just not free() it, and things work!

@dankamongmen
Copy link
Owner Author

so we hit this immediately after seeing the SIG33 used by musl (similarly to glibc) for thread stuff. and we die coming out of the ppoll(). we do not die if we do not free() the alternate stack, which ought have been disabled by our (successful) call to sigaltstack() prior to this. which leads me to believe we're handling SIG33 on the (dead) alternate stack. butttttttttttt...the sigaltstack() wrapper in both glibc and musl is dirt-simple.

the _NSIG/8 in ppoll.c is weird to me. sizeof(sigset_t) is 128. so if this is supposed to be bytes (which i believe it is), it ought be 128, i'd think. if it's bytes but only the used signal mask size (as indicated by _NSIGS), it seems it ought be _NSIGS / 8 + !!(_NSIGS % 8). given that _NSIGS is 65, this seems like it could be relevant (perhaps thread-management signals are being left out...?). but that still shouldn't impact our case.

but everything turns on that free().

musl definitely doesn't seem to be stashing ss_sp anywhere. like i said, the wrapper is two checks and then the syscall.

let me do an instruction-level step following receipt of signal 33...

@dankamongmen
Copy link
Owner Author

hrmm nexti following signal 33 goes directly to SIGSEGV ..

oh shit! alternate stack is a per-thread deal, but we're only disabling it in the thread that calls drop_signals()! duh! that's got to be it! argh! yeah! so on the other thread, we've still got it registered.....arguhrgrhrghrgrhgrhrghrgrhgr. and we see signal 33 on the other thread on musl only. arghrhrrhgrhgrhrgrhgr.

definitely our bug.

best way to handle this....? obviously a distinct alternate signal stack per thread would resolve it, but then we'd want each thread to clean up, which would be a pain. we don't want the fatal signal handlers running once session destruction has begun, though.

and there's still the question of why thread 2 is handling signal 33 on its alternate signal stack at all, when SA_ONSTACK was not used in conjunction with said signal (as far as i'm aware).

@dankamongmen
Copy link
Owner Author

pthread_cancel() in musl calls init_cancellation(), which installs signal handlers for _NSIG/8 signals of a sigset_t. they are all initialized with SA_ONSTACK. signal actions are process-wide, and thus signal 33 now picks up SA_ONSTACK, even for thread 2. thread 2 has not disabled its alternate stack, but that stack has been free()d by thread 1 (after disabling its own use of said alternate stack). thread 2 never asked to handle signal 33 on its alternate signal stack, but thread 1 has spoken for it (this is musl-exclusive). signal 33 hits and boom, down we go.

note that we kill the SA_ONSTACK (again, process-wide) signal handlers in drop_signal(), so were we to get a SIGSEGV or what not during shutdown like this, we wouldn't run our handler at all (and indeed that's exactly what we see).

so you could make the argument that we oughtn't be free()ing our alternate signal stack while some threads still have it enabled. butttttttttttttttttttt if our thread didn't ask to handle the signal using SA_ONSTACK, i don't think it ought expect the alternate signal stack to be used for the signal. technically, if i'd very carefully configured my stack size leaving no margin, and i died in the signal 33 handler code, my configuration would be invalidated, no (you could argue i need to account for that)?

i feel SA_ONSTACK ought not be used by init_cancellation(), @richfelker . i can work around this a number of ways, but what right does musl have to stomp on my alternate stack with its wayward signals?

@richfelker
Copy link

The effect of sigaltstack is thread local. If you don't want to have an alt stack for a particular thread, don't call it there. If you do register it there, you can't free the memory until you unregister it.

I suspect you have memory corruption, possibly exploitable, on glibc due to UAF. On musl mallocng the memory was unmapped causing the fault.

As for why cancellation happens on the alt stack if installed, https://git.musl-libc.org/cgit/musl/commit?id=2e5fff43dd7fc808197744c67cca7908ac19bb4f

@dankamongmen
Copy link
Owner Author

no, did you read the last comment?

@dankamongmen
Copy link
Owner Author

#2828 (comment) <--- i'm mailing the list now, it's definitely figured out. you might not want to apply my fix, though, which is not unreasonable.

@richfelker
Copy link

See the linked commit message. POSIX specifically says "available to the implementation".

@dankamongmen
Copy link
Owner Author

See the linked commit message. POSIX specifically says "available to the implementation".

fair enough. does it cost musl anything to drop the SA_ONSTACK, though? i.e. is it serving a purpose? there's a law of least astonishment argument to be made here; i'm not claiming you're violating posix.

@dankamongmen
Copy link
Owner Author

either way, i'll work around it here. thanks for your helpful comments through the course of this debugging session!

@dankamongmen
Copy link
Owner Author

hrmm, can't tell whether https://www.openwall.com/lists/ stopped listing mail after 2024-12-31 or no one has sent mail since then or my mail was silently rejected; waiting on subscription confirmation mail for ten minutes now...ugh =]

@richfelker
Copy link

See the linked commit message. POSIX specifically says "available to the implementation".

fair enough. does it cost musl anything to drop the SA_ONSTACK, though? i.e. is it serving a purpose? there's a law of least astonishment argument to be made here; i'm not claiming you're violating posix.

Again, see the linked commit. This was a change specifically requested for a long time by Go and possibly other language projects, where the main stack may be very small because of coroutine-type stuff, and where they rightly feel entitled to use small stacks because they don't intend for any signals to be handled. Having the implementation-internal signals run on the alt stack makes it so they don't crash with stack overflow from what's an implementation-detail of how cancellation and MT setuid etc have to be implemented on Linux.

@dankamongmen
Copy link
Owner Author

so i'm thinking the easiest way to fix this is to disable the fatal signal handlers in drop_signals() where we are, but only restore the original (client) alternate signal stack at the very end of notcurses_stop(). that way we needn't do any per-thread work, and only need to add one line to notcurses_stop().

@dankamongmen
Copy link
Owner Author

Again, see the linked commit. This was a change specifically requested for a long time by Go and possibly other language projects, where the main stack may be very small because of coroutine-type stuff, and where they rightly feel entitled to use small stacks because they don't intend for any signals to be handled. Having the implementation-internal signals run on the alt stack makes it so they don't crash with stack overflow from what's an implementation-detail of how cancellation and MT setuid etc have to be implemented on Linux.

got it, thanks for the explanation! i'd missed the commit link; i thought you were linking to POSIX for some reason. well, i can work around this easily enough, but IMHO it warrants a mention on https://wiki.musl-libc.org/functional-differences-from-glibc.html.

@dankamongmen
Copy link
Owner Author

...and the alpine runner is now green, excellent

@richfelker
Copy link

Looks good. FWIW I think glibc also runs the cancellation signal handler on the alt stack, and only didn't crash because the memory was still mapped.

@dankamongmen
Copy link
Owner Author

Looks good. FWIW I think glibc also runs the cancellation signal handler on the alt stack, and only didn't crash because the memory was still mapped.

i actually don't see a signal when running in gdb on libc, and was wondering if they had changed the implementation since the early days of nptl. cancellation being deferred to explicit points seems to imply a signal-based implementation isn't strictly necessary.

@dankamongmen
Copy link
Owner Author

nptl(7) certainly still mentions use of signals 32 and 33, and ppoll(2) mentions explicit handling of NPTL signals. (feel no obligation to answer/investigate this; i'm not disagreeing with you, just musing aloud).

@richfelker
Copy link

Signal is needed to handle the case where the cancellation request arrives after the flag is checked but before the cancellable syscall happens or while the cancellable syscall is blocked. A clever implementation can probably avoid generating signals except when the target thread is in this state or racing to possibly be in this state; maybe glibc is doing that.

@dankamongmen
Copy link
Owner Author

exactly my thoughts

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 2, 2025

i will look into it in thirty minutes or so for completion; in the middle of a command performance for @jart at the moment

@fweimer-rh
Copy link

fweimer-rh commented Jan 2, 2025

Again, see the linked commit. This was a change specifically requested for a long time by Go and possibly other language projects, where the main stack may be very small because of coroutine-type stuff, and where they rightly feel entitled to use small stacks because they don't intend for any signals to be handled. Having the implementation-internal signals run on the alt stack makes it so they don't crash with stack overflow from what's an implementation-detail of how cancellation and MT setuid etc have to be implemented on Linux.

@richfelker I think Go needs this for SIGSYNCCALL only (or SIGSETXID in glibc terms). SIGCANCEL is more predictable because it's only applied to specifically targeted threads. The impacted run times generally do not support canceling their managed threads with pthread_cancel. Even if they do, we try to avoid sending SIGCANCEL to a thread while it is running application code, and we assume that when glibc code is running, there is plenty of stack. (However, this part maybe somewhat racy.)

We haven't seen this issue with glibc because it still does not use SA_ONSTACK for SIGCANCEL. We only made the change for SIGSETXID (previously the Go routine set this behind our back, but this stopped working when we changed the way the signal handler is initialized as part of the libpthread merge).

@richfelker
Copy link

Ah, that makes sense.

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 2, 2025

@fweimer-rh thanks for the explanation, always a pleasure to see you!

We haven't seen this issue with glibc because it still does not use SA_ONSTACK for SIGCANCEL.

ought this be read to imply that glibc intends to use SA_ONSTACK in the future for signals 32 and/or 33?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants