Skip to content

[Tracking] @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build after updating XLA pin #8199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qihqi opened this issue Oct 1, 2024 · 4 comments
Labels
bug Something isn't working openxla xla:gpu

Comments

@qihqi
Copy link
Collaborator

qihqi commented Oct 1, 2024

🐛 Bug

After updating XLA pin from 32ebd694c4d0442e241d76324ff1a721831366b4 to 590cd6fcd1ed24ab9cf494789a0fc524b94a4a6a in PR https://github.com/pytorch/xla/pull/8079/files

Our CI has the following failure:
https://github.com/pytorch/xla/actions/runs/11060810258/job/30732124138?pr=8079 ? the object that is failed to build is bazel build @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so which is not our target.

The exact error is

ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/xla/xla/stream_executor/cuda/BUILD:450:19: no such target '@local_config_cuda//cuda:implicit_cuda_headers_dependency': target 'implicit_cuda_headers_dependency' not declared in package 'cuda' defined by /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/local_config_cuda/cuda/BUILD (Tip: use query "@local_config_cuda//cuda:*" to see all the targets in that package) and referenced by '@xla//xla/stream_executor/cuda:delay_kernel_cuda_cuda'

this @local_config_cuda is defined by using upstream's (https://github.com/google/tsl) cuda_configure starlack function:
like this:

load(
   "@tsl//third_party/gpus/cuda/hermetic:cuda_configure.bzl",
   "cuda_configure",
)

cuda_configure(name = "local_config_cuda")

this bit of code is copied by following this deprecated section of this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage

Current theory:

cuda_configure function is supposed to setup the local_config_cuda to have the build target that tsl needs. But this deprecated non-hermetic version did not do that.

Current tried actions:

We tried to follow the hermetic cuda setup described in this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage

However, it requires the use of clang compiler instead of gcc.

I am attempting to use clang, but this line that forces gcc claims that clang has issues:

xla/.bazelrc

Line 27 in 940bee4

# Force GCC because clang/bazel has issues.

With clang
it produces this error:

      ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/llvm-project/llvm/BUILD.bazel:251:11: Compiling llvm/lib/Support/Valgrind.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Support':
      this rule is missing dependency declarations for the following files included by 'llvm/lib/Support/Valgrind.cpp':
        '/usr/lib/clang/11.0.1/include/stddef.h'
        '/usr/lib/clang/11.0.1/include/__stddef_max_align_t.h'

Which is weird because stddef.h is a system header and bazel should not ask for extra BUILD dependency declared for this.

This post in stackoverflow
says that we should clean bazel cache. Which we did by adding bazel clean --expunge right before the build, and it still doesnt work.

The latest CI with the above change is: https://github.com/pytorch/xla/actions/runs/11115985671/job/30885415097?pr=8079

@baoleai
Copy link
Contributor

baoleai commented Oct 8, 2024

Is the original issue with cuda_configure caused by the changes made in openxla/xla@29bd19c

@qihqi
Copy link
Collaborator Author

qihqi commented Oct 10, 2024

Is the original issue with cuda_configure caused by the changes made in openxla/xla@29bd19c

Maybe. How would one verify?

Regardless, as long as we in the non-hermetic CUDA path of tsl, this is bound to happen again.

@JackCaoG
Copy link
Collaborator

I am going to disable the GPU build and test in #8286 for now to unblock the PIN update.

@ysiraichi ysiraichi added bug Something isn't working xla:gpu openxla labels Mar 6, 2025
@ysiraichi
Copy link
Collaborator

Closing this, since #8593 brought CUDA CI back. Hermetic CUDA transition is still in the works #8665.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working openxla xla:gpu
Projects
None yet
Development

No branches or pull requests

4 participants