support gpu spmd #5

mars1248 · 2023-12-20T09:28:09Z

During gpu training, each gpu corresponds to a process, so each gpu can only get its own device information for executing subgraphs, so it is necessary to modify the underlying logic of open xla to support local execute.
Using the original logic would result in a coredump when the result of the computation is retrieved.

wbmc · 2023-12-20T13:45:02Z

xla/pjrt/pjrt_stream_executor_client.cc

@@ -2881,6 +2881,7 @@ PjRtStreamExecutorLoadedExecutable::Execute(
    auto& statusor = results[i];
    if (!statusor.ok()) {
      if (returned_futures.has_value()) {
+        VLOG(0) << "returned_futures clear";


Should remove the LOG or use VLOG(3) for debug?

wbmc · 2023-12-20T13:45:35Z

xla/pjrt/pjrt_stream_executor_client.cc

+        num_partitions());
+  }
+
+  VLOG(1) << "Executing computation " << name()


xla/pjrt/pjrt_stream_executor_client.cc

+          << " num_partitions=" << num_partitions()
+          << " num_addressable_devices=" << num_addressable_devices;
+  TF_ASSIGN_OR_RETURN(
+  auto result,


support gpu spmd

8c91659

github-actions bot added the kokoro:force-run label Dec 20, 2023

wbmc reviewed Dec 20, 2023

View reviewed changes

delete debug log

f888478

wbmc merged commit 9225332 into intelligent-machine-learning:main Dec 27, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support gpu spmd #5

support gpu spmd #5

mars1248 commented Dec 20, 2023 •

edited

Loading

wbmc Dec 20, 2023

wbmc Dec 20, 2023

This comment was marked as resolved.

support gpu spmd #5

support gpu spmd #5

Conversation

mars1248 commented Dec 20, 2023 • edited Loading

wbmc Dec 20, 2023

Choose a reason for hiding this comment

wbmc Dec 20, 2023

Choose a reason for hiding this comment

This comment was marked as resolved.

mars1248 commented Dec 20, 2023 •

edited

Loading