Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Lit hang when executing program that does not exist #16351

Open
ayylol opened this issue Dec 12, 2024 · 0 comments
Open

[SYCL] Lit hang when executing program that does not exist #16351

ayylol opened this issue Dec 12, 2024 · 0 comments
Labels
bug Something isn't working confirmed

Comments

@ayylol
Copy link
Contributor

ayylol commented Dec 12, 2024

Describe the bug

Lit lines that try to run a program that does not exist will cause the execution of lit to hang. This only occurs with tests inside the sycl/test-e2e and sycl/test folders, and only when using one of our containers.

To reproduce

  1. Include a code snippet that is as short as possible
    Add the following file to either sycl/test-e2e or sycl/test
// RUN: non_existent
  1. Specify the command which should be used to compile the program
  2. Specify the command which should be used to launch the program
    llvm-lit ./test.cpp
  3. Indicate what is wrong and what was expected
    The test itself should fail immediately, but lit will not immediately report the test statistics and quit. Instead it will hang until either you send an interrupt signal, or a timeout of 10mins is reached.

Environment

  • OS: Linux Container
    • ubuntu2204_intel_drivers:alldeps (confirmed locally)
    • ubuntu2204_intel_drivers:latest (observed on ci)
      NOTE: Doesn't reproduce outside of a container, or in some other containers.
  • DPC++ version: 28e8416

Additional context

The hang occurs at the call to the multiprocessing.Pool.join method inside the lit implementation.

try:
self._wait_for(async_results, deadline)
except:
pool.terminate()
raise
finally:
pool.join()

This seemingly happens if inside one of the processes a function raises an exception and it is not caught by the function who directly calls the throwing function. In our case this occurs in the _executeShCmd function
if not executable:
raise InternalShellError(j, "%r: command not found" % args[0])

The executeShCmd function calls _executeShCmd, however it does not catch this exception, rather this is caught in executeScriptInternal
try:
shenv = ShellEnvironment(cwd, test.config.environment)
exitCode, timeoutInfo = executeShCmd(
cmd, shenv, results, timeout=litConfig.maxIndividualTestTime
)
except InternalShellError:
e = sys.exc_info()[1]
exitCode = 127
results.append(ShellCommandResult(e.command, "", e.message, exitCode, False))

adding a try/except to the executeShCmd circumvents this hang.

However it is unclear if this is actually an issue in upstream llvm, since this is not reproducible in either the clang/test or llvm/test folders (To be able to compare we need to set useExternalSh to false in the call of lit.TestRunner._runShTest), and this is only reproducible in our containers.

Probably related: https://stackoverflow.com/questions/15314189/python-multiprocessing-pool-hangs-at-join

Setting useExternalSh to true in the call of lit.TestRunner._runShTest also works as a workaround which is what is done in #16321 to avoid the hang. However this makes the test stdout less readable (all stdout is printed in one block, rather than separated by RUN: lines).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed
Projects
None yet
Development

No branches or pull requests

2 participants