Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with rstrnt-reboot giving control back to the runtest script #219

Open
veruu opened this issue May 19, 2021 · 5 comments
Open
Assignees

Comments

@veruu
Copy link

veruu commented May 19, 2021

We hit a race condition with rstrnt-reboot where the restraint process got terminated by systemd and it returned execution back to the test script. This happened when running restraint in standalone mode.

Specific case where we hit this problem: We have a script that installs a kernel, reboots the machine and checks the correct new kernel booted.

You can imagine the short version of the script as (pseudocode)

if REBOOTCOUNT == 0; then
    install_kernel
    rstrnt-reboot
else
    verify_kernel_version
    rstrnt-report-result testname PASS 0
fi

(Full version is available at https://gitlab.com/cki-project/kernel-tests/-/tree/main/distribution/kpkginstall)
After rstrnt-reboot was called, our logs contained an unexpected line

./runtest.sh: line 476:  3842 Terminated              rstrnt-reboot

The generated index file by restraint had a new "exit_code" FAIL result.

In the console log, we see messages from systemd sending SIGTERM to running tasks:

[    
  OK     
] Reached target Shutdown.  
[  198.436632] systemd-shutdown[1]: Sending SIGTERM to remaining processes... 
[  198.460924] systemd-journald[669]: Received SIGTERM from PID 1 (systemd-shutdow). 
[  198.470495] systemd-shutdown[1]: Sending SIGKILL to remaining processes... 

The timing of all these logs/results matches up. The rstrnt-reboot command terminated and returned the control to the test script, which then exited with the nonzero retcode of rstrnt-reboot (since that's the last command it ran). The script exiting with the retcode of the last executed command is expected (as there is no result reporting in that execution branch), rstrnt-reboot terminating and returning the control before the actual reboot happens is not. This is confirmed by looking into the code for rstrnt-reboot which explicitly states this should not happen: https://github.com/beaker-project/restraint/blob/master/scripts/rstrnt-reboot#L13

@wackrat
Copy link

wackrat commented May 27, 2021

A trap statement for SIGTERM in rstrnt-reboot might improve the odds.

If rstrnt-reboot ignores SIGTERM, then by the time the rstrnt-reboot process receives SIGKILL other processes (like kpkginstall/runtest.sh) should already have received SIGTERM.

@cbouchar cbouchar self-assigned this Jun 22, 2021
@cbouchar
Copy link
Contributor

@veruu @wackrat I've written code to include a trap for SIGTERM but I agree with wackrat that it's not a full-proof solution.
Is it possible for your side to exit with non-zero cause I'm not sure it is a good idea to ignore the SIGKILL. I can do that too if you like but I'm thinking perhaps something can also be done on your end.
I have code written if you like to try it out. What image would you need?

@veruu
Copy link
Author

veruu commented Jun 25, 2021

@cbouchar we added that as a workaround to all our tests when we ran into this problem. However, other people may run into the same issue who don't have it in place for all of the tests they are running so it may still be a good idea to have it fixed on restraint side, or at least document this need very well and make it obvious

@wackrat
Copy link

wackrat commented Jun 25, 2021

I'm not sure it is a good idea to ignore the SIGKILL.

SIGKILL can't be ignored.

As long as callers of rstrnt-reboot exit cleanly upon SIGTERM (which they presumably do if they don't trap SIGTERM), by the time SIGKILL reaches rstrnt-reboot I'd expect its callers to have exited already. I haven't tested any of this myself, though.

It is possible to assign a command with trap. If a caller of rstrnt-reboot does that, it can include an explicit exit after any appropriate cleanup.

@cbouchar
Copy link
Contributor

@wackrat
I suspected SIGKILL couldn't be ignored. Thanks for the confirmation.
I chose the following trap method:
trap "" SIGTERM
This method prevents interrupting rstrnt-reboot. It is discussed in the following reference.
Reference: https://www.linuxjournal.com/content/bash-trap-command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants