Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pjrt_c_api_cpu_plugin - MacOS amd and arm64 compilation #19152

Open
dmarro89 opened this issue Nov 7, 2024 · 4 comments
Open

pjrt_c_api_cpu_plugin - MacOS amd and arm64 compilation #19152

dmarro89 opened this issue Nov 7, 2024 · 4 comments

Comments

@dmarro89
Copy link

dmarro89 commented Nov 7, 2024

Hi,
I'm trying to compile on macOS amd and arm64 the plugin pjrt_c_api_cpu_plugin.so.
The build succeeds but when I try to use it in gopjrt I'm getting a segmentation violation error.

What I've done:
On the xla project :

  • ./configure.py --backend CPU --os DARWIN
  • bazel build //xla/pjrt/c:pjrt_c_api_cpu_plugin.so

Then I've just tried to run the gopjrt tests and they are showing me the error.

If I use the JAX metal plugin -> pjrt_plugin_metal_14.dylib the project is working fine.

Someone has already compiled and used the plugin on macOS?

Thanks

@janpfeifer
Copy link
Contributor

Same error occurs with bazel test //xla/pjrt:pjrt_c_api_client_test, it bumps on complex number MatMul not defined in Eigen.

@janpfeifer
Copy link
Contributor

janpfeifer commented Nov 17, 2024

But aside from these broken tests issues, a pjrt_c_api_cpu_plugin.so can be built on Mac (both x86_64 and arm64). But compilation of very trivial HLO programs fail on Mac.

(XLA at commit 9ab7d704d7fe7e73fc3976adc2ccec070bc9a2ea)

To factor out the Go bindings (gopjrt project) that we were using to test, I created the following additional mac_test.cc file under xla/pjrt, that instead of trying to link along the PRJT with the test (which fails on the complex number MatMul described above), tries to reproduce what a PJRT client would do, which is to use dlopen on the plugin.

The C++ test compiles runs and fails without an error in Mac -- but succeeds as expected in Linux.

Here is the code:

  • Same includes as //xla/pjrt/c:pjrt_c_api_cpu_test, except removed the #include "xla/pjrt/c/pjrt_c_api_cpu_internal.h" -- to avoid linking in the pjrt and the missing complex MatMul.
  • Same BUILD dependencies as //xla/pjrt/c:pjrt_c_api_cpu_test, except "//xla/pjrt/c:pjrt_c_api_cpu_internal" which was commented out.
namespace xla {
namespace {

using std::cout;
using std::endl;

const std::string kPjrtPluginPath = "/usr/local/lib/gomlx/pjrt/pjrt_c_api_cpu_plugin.so";

static void SetUpCpuPjRtApi() {
  std::string device_type = "cpu";
  auto status = ::pjrt::PjrtApi(device_type);
  if (!status.ok()) {
    TF_ASSERT_OK_AND_ASSIGN(
      const PJRT_Api* api,
      pjrt::LoadPjrtPlugin(device_type, kPjrtPluginPath)
    );
    cout << "Loaded PJRT from " << kPjrtPluginPath << endl;
  }
}

TEST(PjRtCApiClientTest, EndToEnd) {
  SetUpCpuPjRtApi();
  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<PjRtClient> client,
                          GetCApiClient("cpu"));
  cout << "\tplatform_name=" << client->platform_name()
    << ", platform_id=" << client->platform_id()
    << endl;

  // Create f(x) = x^2
  cout << "Create f(x) = x^2:" << endl;
  XlaBuilder builder("x*x+1");
  Shape x_shape = ShapeUtil::MakeShape(F32, {});
  auto x = Parameter(&builder, 0, x_shape, "x");
  auto f = Mul(x, x);
  auto computation = builder.Build(f).value();
  cout << "\tComputation built." << endl;
  std::unique_ptr<PjRtLoadedExecutable> executable =
      client->Compile(computation, CompileOptions()).value();
  cout << "\tCompiled to executable." << endl;

  // Transfer concrete x value.
  std::vector<float> data{3};
  TF_ASSERT_OK_AND_ASSIGN(
      auto x_value,
      client->BufferFromHostBuffer(
          data.data(), x_shape.element_type(), x_shape.dimensions(),
          /*byte_strides=*/std::nullopt,
          PjRtClient::HostBufferSemantics::kImmutableOnlyDuringCall, nullptr,
          client->addressable_devices()[0]));
  cout << "Tranferred value of x=" << data[0] << " to device." << endl;

  // Execute function.
  ExecuteOptions execute_options;
  execute_options.non_donatable_input_indices = {0};
  std::vector<std::vector<std::unique_ptr<PjRtBuffer>>> results =
      executable->Execute({{x_value.get()}}, execute_options)
          .value();
  ASSERT_EQ(results[0].size(), 1);
  auto& result_buffer = results[0][0];
  // How do we print the actual result !?
  cout << "Success" << endl;
}

}  // namespace
}  // namespace xla

When running with bazel test --test_output=streamed //xla/pjrt:mac_test, I get:

...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from PjRtCApiClientTest
[ RUN      ] PjRtCApiClientTest.EndToEnd
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1731834725.730697  488487 pjrt_api.cc:99] GetPjrtApi was found for cpu at /usr/local/lib/gomlx/pjrt/pjrt_c_api_cpu_plugin.so
I0000 00:00:1731834725.730723  488487 pjrt_api.cc:78] PJRT_Api is set for device type cpu
Loaded PJRT from /usr/local/lib/gomlx/pjrt/pjrt_c_api_cpu_plugin.so
2024-11-17 09:12:05.731356: I xla/pjrt/pjrt_c_api_client.cc:127] PjRtCApiClient created.
        platform_name=cpu, platform_id=17910157233389077760
Create f(x) = x^2:
        Computation built.
2024-11-17 09:12:05.736506: I xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 13580721266388441979
FAIL: //xla/pjrt:mac_test (see /private/var/tmp/_bazel_m1/7c4fef9842af5ebc1d7b23e8ea13a23d/execroot/xla/bazel-out/darwin_arm64-opt/testlogs/xla/pjrt/mac_test/test.log)
Target //xla/pjrt:mac_test up-to-date:
  bazel-bin/xla/pjrt/mac_test
...

Yes, it fails without an error of any type before finishing to compile the HLO program.
The contents of the test.log are the same.

@janpfeifer
Copy link
Contributor

I'm trying to debug it, and building it with -c dbg and running it, it gives more hints:

$ bazel build -c dbg //xla/pjrt:mac_test
... (build output)...
$ ./bazel-bin/xla/pjrt/mac_test
Running main() from gmock_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from PjRtCApiClientTest
[ RUN      ] PjRtCApiClientTest.EndToEnd
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1731836397.519141  525455 pjrt_api.cc:99] GetPjrtApi was found for cpu at /usr/local/lib/gomlx/pjrt/pjrt_c_api_cpu_plugin.so
I0000 00:00:1731836397.519471  525455 pjrt_api.cc:78] PJRT_Api is set for device type cpu
Loaded PJRT from /usr/local/lib/gomlx/pjrt/pjrt_c_api_cpu_plugin.so
2024-11-17 10:39:57.523582: I xla/pjrt/pjrt_c_api_client.cc:127] PjRtCApiClient created.
        platform_name=cpu, platform_id=17910157233389077760
Create f(x) = x^2:
        Computation built.
2024-11-17 10:39:57.618260: I xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 4980911022492290414
mac_test(37825,0x1e9c9fac0) malloc: *** error for object 0x400000000: pointer being realloc'd was not allocated
mac_test(37825,0x1e9c9fac0) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Again, notice this is something that only happens on DARWIN platform.

@janpfeifer
Copy link
Contributor

Also if one changes xla/pjrt/pjrt_c_api_client_test.cc to load the plugin (as opposed to link it in), it fails in the same way on Mac. To do that just replace a couple of line in the function:

static void SetUpCpuPjRtApi() {
  std::string device_type = "cpu";
  auto status = ::pjrt::PjrtApi(device_type);
  if (!status.ok()) {
    TF_ASSERT_OK(
        pjrt::SetPjrtApi(device_type, ::pjrt::cpu_plugin::GetCpuPjrtApi()));
  }
}

With (adjust kPjrtPluginpath to the path of the PJRT plugin):

const std::string kPjrtPluginPath = "/usr/local/lib/gomlx/pjrt/pjrt_c_api_cpu_plugin.so";
static void SetUpCpuPjRtApi() {
  std::string device_type = "cpu";
  auto status = ::pjrt::PjrtApi(device_type);
  if (!status.ok()) {
    TF_ASSERT_OK_AND_ASSIGN(
      const PJRT_Api* api,
      pjrt::LoadPjrtPlugin(device_type, kPjrtPluginPath));
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants