Skip to content

[Tracking] @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build after updating XLA pin #8199

Closed
@qihqi

Description

@qihqi

🐛 Bug

After updating XLA pin from 32ebd694c4d0442e241d76324ff1a721831366b4 to 590cd6fcd1ed24ab9cf494789a0fc524b94a4a6a in PR https://github.com/pytorch/xla/pull/8079/files

Our CI has the following failure:
https://github.com/pytorch/xla/actions/runs/11060810258/job/30732124138?pr=8079 ? the object that is failed to build is bazel build @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so which is not our target.

The exact error is

ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/xla/xla/stream_executor/cuda/BUILD:450:19: no such target '@local_config_cuda//cuda:implicit_cuda_headers_dependency': target 'implicit_cuda_headers_dependency' not declared in package 'cuda' defined by /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/local_config_cuda/cuda/BUILD (Tip: use query "@local_config_cuda//cuda:*" to see all the targets in that package) and referenced by '@xla//xla/stream_executor/cuda:delay_kernel_cuda_cuda'

this @local_config_cuda is defined by using upstream's (https://github.com/google/tsl) cuda_configure starlack function:
like this:

load(
   "@tsl//third_party/gpus/cuda/hermetic:cuda_configure.bzl",
   "cuda_configure",
)

cuda_configure(name = "local_config_cuda")

this bit of code is copied by following this deprecated section of this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage

Current theory:

cuda_configure function is supposed to setup the local_config_cuda to have the build target that tsl needs. But this deprecated non-hermetic version did not do that.

Current tried actions:

We tried to follow the hermetic cuda setup described in this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage

However, it requires the use of clang compiler instead of gcc.

I am attempting to use clang, but this line that forces gcc claims that clang has issues:

xla/.bazelrc

Line 27 in 940bee4

# Force GCC because clang/bazel has issues.

With clang
it produces this error:

      ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/llvm-project/llvm/BUILD.bazel:251:11: Compiling llvm/lib/Support/Valgrind.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Support':
      this rule is missing dependency declarations for the following files included by 'llvm/lib/Support/Valgrind.cpp':
        '/usr/lib/clang/11.0.1/include/stddef.h'
        '/usr/lib/clang/11.0.1/include/__stddef_max_align_t.h'

Which is weird because stddef.h is a system header and bazel should not ask for extra BUILD dependency declared for this.

This post in stackoverflow
says that we should clean bazel cache. Which we did by adding bazel clean --expunge right before the build, and it still doesnt work.

The latest CI with the above change is: https://github.com/pytorch/xla/actions/runs/11115985671/job/30885415097?pr=8079

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions