Support async collective op execution #1287

eedalong · 2024-03-14T09:25:09Z

No description provided.

Yancey1989 · 2024-05-20T09:24:42Z

tao_compiler/mlir/disc/disc_compiler.cc

@@ -529,6 +530,14 @@ LogicalResult LowerHLOToLLVM(ModuleOp m, const DISCLoweringOptions& options) {
  pm.addNestedPass<FuncOp>(createCanonicalizerPass());
  pm.addNestedPass<FuncOp>(createCSEPass());
  pm.addNestedPass<FuncOp>(createCanonicalizerPass());
+
+  bool enable_op_schedule = false;
+  tensorflow::ReadBoolFromEnvVar("DISC_ENABLE_OP_SCHEDULE", enable_op_schedule,


Does this pass cause any native optimization? If not, let's enable it by default, we have too many flags...

Ok, we havent found a case which schedule pass will cause negative effect.

Yancey1989 · 2024-05-20T09:28:29Z

tao_compiler/mlir/disc/transforms/disc_collective_ops_rewriter.cc

@@ -55,6 +55,24 @@ std::optional<std::string> ReductionKindToString(ReductionKind kind) {
  return std::nullopt;
 }

+bool EnableAsyncCollective(Operation* op) {
+  if (llvm::isa<mhlo::AllReduceOp>(op)) {
+    if (const char* env_p = std::getenv("ENABLE_ASYNC_ALL_REDUCE")) {


make all collective ops to be async? Do we need to separate sync and async collective ops ?

Yancey1989 · 2024-05-20T09:35:21Z

tao_compiler/mlir/disc/transforms/disc_op_schedule.cc

+
+class Edge;
+
+struct GraphNode {


there is latency hidden schedule in tensorflow source code, can we reuse the base class implementation?
ref: https://github.com/pai-disc/tensorflow/blob/features/bladedisc_rebase_20230202/tensorflow/compiler/xla/service/latency_hiding_scheduler.h

Yancey1989

LGTM with tiny comment, does not block merge.

Yancey1989 · 2024-05-23T06:07:41Z

tao_compiler/mlir/ral/context/base/cuda/cuda_context_impl.h

 #endif

 struct BaseCudaContextOption {
  ncclComm_t nccl_comm = nullptr;
  GpuStreamHandle stream = nullptr;
+  GpuStreamHandle comm_stream = nullptr;


comm_stream or compute_stream ?

The original stream is the compute stream, and we add this new comm_stream

Yancey1989 · 2024-05-23T06:56:12Z

tao_compiler/mlir/disc/transforms/disc_op_schedule.cc

+    }
+  }
+
+  void InitilizeGrpahTopology(std::vector<Operation*>& post_order_instructions,


typo: Graph

Ok, will fixed it in next pr.

Yancey1989 · 2024-05-23T07:23:54Z

tao_compiler/mlir/disc/transforms/disc_op_schedule.cc

+
+    for (auto& block : main_func.getBody()) {
+      for (int op_idx = 0; op_idx < scheduled_op_sequence.size(); op_idx++) {
+        scheduled_op_sequence[op_idx]->moveBefore(&block.front());


Does the scheduled_op_sequence contain all ops before scheduling? Does L983 loop mean using scheduled op sequence conver the old op sequence ?

Yes, it contains all ops.

eedalong force-pushed the support_async_collective branch 9 times, most recently from 9e99f8f to 745c0a1 Compare March 15, 2024 06:09

Yancey1989 mentioned this pull request Mar 15, 2024

support collective operators #1289

Closed

2 tasks

eedalong force-pushed the support_async_collective branch 6 times, most recently from 724a836 to 9235735 Compare March 18, 2024 08:54

eedalong changed the title ~~support async collective op execution~~ Support async collective op execution Mar 18, 2024

eedalong force-pushed the support_async_collective branch 2 times, most recently from eeba6d6 to 30248fa Compare March 18, 2024 09:46

support async collective op execution

4d2a04c

eedalong force-pushed the support_async_collective branch 8 times, most recently from cf81fd5 to 7fb1499 Compare March 22, 2024 08:10

eedalong force-pushed the support_async_collective branch 2 times, most recently from c4fdcae to 05c6189 Compare April 2, 2024 07:24

eedalong force-pushed the support_async_collective branch 2 times, most recently from df89148 to 53d7b68 Compare April 22, 2024 08:59

eedalong force-pushed the support_async_collective branch 2 times, most recently from c48ecb3 to 9b8cd86 Compare May 6, 2024 06:12

eedalong force-pushed the support_async_collective branch from 9b8cd86 to 63dda65 Compare May 17, 2024 02:24

rebase master branch

f1039be

eedalong force-pushed the support_async_collective branch 8 times, most recently from 48c9638 to 7c33e32 Compare May 20, 2024 07:07

eedalong requested a review from Yancey1989 May 20, 2024 07:19

eedalong force-pushed the support_async_collective branch from 7c33e32 to 76a2495 Compare May 20, 2024 07:45

merge master

e6b6339

eedalong force-pushed the support_async_collective branch from 76a2495 to 9e637d5 Compare May 20, 2024 08:59

Yancey1989 reviewed May 20, 2024

View reviewed changes

eedalong force-pushed the support_async_collective branch 4 times, most recently from a43e5a4 to ef6bdaf Compare May 21, 2024 08:18

merge master

5a6321c

eedalong force-pushed the support_async_collective branch from ef6bdaf to 5a6321c Compare May 21, 2024 09:14

Merge remote-tracking branch 'origin/main' into support_async_collective

e2a91b4

eedalong force-pushed the support_async_collective branch from b2effb7 to e2a91b4 Compare May 22, 2024 11:41

eedalong requested a review from Yancey1989 May 23, 2024 06:06

Yancey1989 approved these changes May 23, 2024

View reviewed changes

eedalong merged commit 1716b1c into alibaba:main May 23, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support async collective op execution #1287

Support async collective op execution #1287

eedalong commented Mar 14, 2024

Yancey1989 May 20, 2024

eedalong May 20, 2024

eedalong May 21, 2024

Yancey1989 May 20, 2024

eedalong May 21, 2024

Yancey1989 May 20, 2024

Yancey1989 left a comment

Yancey1989 May 23, 2024

eedalong May 23, 2024

Yancey1989 May 23, 2024

eedalong May 23, 2024

Yancey1989 May 23, 2024

eedalong May 23, 2024

Support async collective op execution #1287

Support async collective op execution #1287

Conversation

eedalong commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment