Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: libcu++ barrier*.pass tests timeout on t4, rtx2080, v100 *only* when compiled for the target arch. #3579

Open
1 task done
alliepiper opened this issue Jan 29, 2025 · 0 comments
Assignees
Labels
bug Something isn't working right.

Comments

@alliepiper
Copy link
Collaborator

alliepiper commented Jan 29, 2025

Is this a duplicate?

Type of Bug

Runtime Error

Component

libcu++

Describe the bug

While migrating the CI to use nvks runners, two libcu++ tests fail with timeouts: heterogeneous/barrier_abi_v2.pass.cpp and heterogeneous/barrier.pass.cpp.

Note that this only happens when specifying an explicit arch. The tests pass on rtx2080, t4, and v100 when compiled for 60;70;80, but fail when explicitly compiled only for 70 (v100) or 75 (t4, rtx2080).

Failure output from CI log https://github.com/NVIDIA/cccl/actions/runs/13022094204/job/36324973191:

  1: UNRESOLVED: libcu++ :: heterogeneous/barrier.pass.cpp (362 of 2341)
  1: ******************** TEST 'libcu++ :: heterogeneous/barrier.pass.cpp' FAILED ********************
  1: Exception during script execution:
  1: Traceback (most recent call last):
  1:   File "/home/coder/.local/share/venvs/cccl/lib/python3.10/site-packages/lit/worker.py", line 76, in _execute_test_handle_errors
  1:     result = test.config.test_format.execute(test, lit_config)
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/format.py", line 83, in execute
  1:     return self._execute(test, lit_config)
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/format.py", line 158, in _execute
  1:     return self._evaluate_pass_test(
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/format.py", line 208, in _evaluate_pass_test
  1:     cmd, out, err, rc = self.executor.run(
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/executor.py", line 43, in run
  1:     out, err, rc = executeCommand(cmd, cwd=work_dir, env=env, timeout=self.timeout)
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/util.py", line 242, in executeCommand
  1:     raise ExecuteCommandTimeoutException(
  1: libcudacxx.util.ExecuteCommandTimeoutException
  1: 
  1: 
  1: ********************
  1: UNRESOLVED: libcu++ :: heterogeneous/barrier_abi_v2.pass.cpp (364 of 2341)
  1: ******************** TEST 'libcu++ :: heterogeneous/barrier_abi_v2.pass.cpp' FAILED ********************
  1: Exception during script execution:
  1: Traceback (most recent call last):
  1:   File "/home/coder/.local/share/venvs/cccl/lib/python3.10/site-packages/lit/worker.py", line 76, in _execute_test_handle_errors
  1:     result = test.config.test_format.execute(test, lit_config)
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/format.py", line 83, in execute
  1:     return self._execute(test, lit_config)
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/format.py", line 158, in _execute
  1:     return self._evaluate_pass_test(
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/format.py", line 208, in _evaluate_pass_test
  1:     cmd, out, err, rc = self.executor.run(
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/test/executor.py", line 43, in run
  1:     out, err, rc = executeCommand(cmd, cwd=work_dir, env=env, timeout=self.timeout)
  1:   File "/home/coder/cccl/libcudacxx/test/utils/libcudacxx/util.py", line 242, in executeCommand
  1:     raise ExecuteCommandTimeoutException(
  1: libcudacxx.util.ExecuteCommandTimeoutException

How to Reproduce

Run libcu++ tests on a t4 or rtx2080. Possibly any sm75 system?

To do this through CI, edit ci/matrix.yaml and add the following lines to the override section:

  override:
    - {jobs: ['test'],  project: ['libcudacxx'], std: 'max', cxx: ['gcc', 'clang'], gpu: 't4', sm: 'gpu'}
    - {jobs: ['test'],  project: ['libcudacxx'], std: 'max', cxx: ['gcc', 'clang'], gpu: 'rtx2080', sm: 'gpu'}

This will only run the relevant jobs to ensure that the bug is fixed.

Expected behavior

The tests should pass regardless of hardware.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@alliepiper alliepiper added the bug Something isn't working right. label Jan 29, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 29, 2025
@alliepiper alliepiper changed the title [BUG]: libcu++ barrier*.pass tests timeout on t4, rtx2080 [BUG]: libcu++ barrier*.pass tests timeout on t4, rtx2080, v100 *only* when compiled for the target arch. Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

2 participants