Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in rstrnt-report-result, tests stop #301

Open
ianw opened this issue Jul 24, 2023 · 2 comments
Open

Hang in rstrnt-report-result, tests stop #301

ianw opened this issue Jul 24, 2023 · 2 comments

Comments

@ianw
Copy link

ianw commented Jul 24, 2023

Hello,

I am trying to debug a repeated watchdog timeout seen on some jobs.

We get into a situtation where rstrant-report-result seems to be stuck and the jobs then time out. This is where we are at on the system...

├─restraintd,1195 --port 8081
  │   ├─10_bash_login,42032 -l /usr/share/restraint/plugins/task_run.d/10_bash_login /usr/share/restraint/plugins/task_run.d/15_beakerlib /usr/share/restraint/plugins/task_run.d/20_unconfined...
  │   │   └─make,42057 run
  │   │       └─sh,42074 -c ( set +o posix; . /usr/bin/rhts_environment.sh; \\\012\011. /usr/share/beakerlib/beakerlib.sh; \\\012\011. runtest.sh )
  │   │           └─sh,42075 -c ( set +o posix; . /usr/bin/rhts_environment.sh; \\\012\011. /usr/share/beakerlib/beakerlib.sh; \\\012\011. runtest.sh )
  │   │               └─rstrnt-report-result,410507 --rhts stress-scheduler-class PASS /tmp/tmp.6PZqM1CmWv 0
  │   ├─{restraintd},1196
  │   └─{restraintd},253100

When I take a look at the rstrnt-report-result process it is stuck in a poll, looking like it's waiting for something from 8081

# strace -p 410507
strace: Process 410507 attached
ppoll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 2, {tv_sec=1858, tv_nsec=895308304}, NULL, 0^Cstrace: Process 410507 detached
 <detached ...>
 
 
# lsof -nP -p 410507
COMMAND      PID USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
rstrnt-re 410507 root  cwd       DIR  253,0      150 69345240 /mnt/tests/kernel/general/scheduler/sched_stress
rstrnt-re 410507 root  rtd       DIR  253,0      224      128 /
rstrnt-re 410507 root  txt       REG  253,0  3577128   591332 /usr/bin/rstrnt-report-result
rstrnt-re 410507 root  mem       REG  253,0  1939952 33833563 /usr/lib64/libc-2.28.so
rstrnt-re 410507 root  mem       REG  253,0   209216 33833575 /usr/lib64/libpthread-2.28.so
rstrnt-re 410507 root  mem       REG  253,0    70088 33833583 /usr/lib64/libutil-2.28.so
rstrnt-re 410507 root  mem       REG  253,0    75616 33833565 /usr/lib64/libdl-2.28.so
rstrnt-re 410507 root  mem       REG  253,0   137856 33833577 /usr/lib64/libresolv-2.28.so
rstrnt-re 410507 root  mem       REG  253,0    85400 33833579 /usr/lib64/librt-2.28.so
rstrnt-re 410507 root  mem       REG  253,0   737848 33833567 /usr/lib64/libm-2.28.so
rstrnt-re 410507 root  mem       REG  253,0    26998 67415935 /usr/lib64/gconv/gconv-modules.cache
rstrnt-re 410507 root  mem       REG  253,0  1082760 33833556 /usr/lib64/ld-2.28.so
rstrnt-re 410507 root    0r      CHR    1,3      0t0     1027 /dev/null
rstrnt-re 410507 root    1w     FIFO   0,13      0t0   102129 pipe
rstrnt-re 410507 root    2w     FIFO   0,13      0t0   102129 pipe
rstrnt-re 410507 root    3u  a_inode   0,14        0     9342 [eventfd]
rstrnt-re 410507 root    4u     IPv6 475908      0t0      TCP [::1]:56232->[::1]:8081 (ESTABLISHED)
rstrnt-re 410507 root    5u  a_inode   0,14        0     9342 [eventfd]

When I attach gdb to it I can see it sitting in upload_results()

(gdb) bt
#0  0x0000ffffb91db494 in poll () from /lib64/libc.so.6
#1  0x000000000047860c in g_socket_condition_timed_wait ()
#2  0x000000000047956c in g_socket_receive_with_timeout ()
#3  0x0000000000461da4 in g_input_stream_read ()
#4  0x0000000000461da4 in g_input_stream_read ()
#5  0x0000000000438dd8 in soup_filter_input_stream_read_until ()
#6  0x0000000000438fbc in soup_filter_input_stream_read_line ()
#7  0x000000000040e740 in io_read ()
#8  0x000000000040eddc in io_run_until ()
#9  0x000000000040f664 in io_run ()
#10 0x0000000000417998 in soup_session_process_queue_item ()
#11 0x0000000000417f3c in soup_session_real_send_message ()
#12 0x0000000000406d40 in upload_results ()
#13 0x0000000000405f84 in main ()

After examining the environment variables of the process via proc, I can see it's using RECIPE_URL=http://localhost:8081/recipes/14282847 TASKID=163417670

On the restraintd side I can see a couple of files related to this, that seem small and reasonable

[root@hpe-apollo-cn99xx-15-vm-16 163417670]# pwd
/var/tmp/restraintd/logs/163417670
[root@hpe-apollo-cn99xx-15-vm-16 163417670]# ls -lh
total 76K
-rw-r--r--. 1 root root 9.9K Jul 24 01:49 harness.log
-rw-r--r--. 1 root root  64K Jul 24 00:46 task.log

In the restraintd log I don't see anything in particular - restraintd.log

Finally, if I try this manually, it just hangs (although TBH not sure if this should work outside the framework...)

# RECIPE_URL=http://localhost:8081/recipes/14282847 TASKID=163417670 /usr/bin/rstrnt-report-result --rhts stress-scheduler-class PASS /tmp/tmp.6PZqM1CmWv 0
** stress-scheduler-class PASS Score:0
^C

Any ideas, or further debugging help for how this got into this situation welcome!

@ianw
Copy link
Author

ianw commented Jul 24, 2023

p.s. the two situations where this occurred and triggered the timeout is

https://beaker.engineering.redhat.com/jobs/8093072
https://beaker.engineering.redhat.com/jobs/8100959

both aarch64 hosts

@ianw
Copy link
Author

ianw commented Jul 25, 2023

On closer inspection, the two tests where this failed were run on systems with >=4 CPU's, which triggered [1] and a rhts-reboot. The whole testing passed on a system with 3 CPUs -- the main difference being it failed out here and didn't run the reboot. So it feels like something to do with the restarting of restraintd during the reboot has something to do with this ...

[1] https://gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/blob/main/general/process/reg-suit/testcase/bz1157802.sh?ref_type=heads#L211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant