-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support async collective op execution #1287
Conversation
9e99f8f
to
745c0a1
Compare
724a836
to
9235735
Compare
eeba6d6
to
30248fa
Compare
cf81fd5
to
7fb1499
Compare
c4fdcae
to
05c6189
Compare
df89148
to
53d7b68
Compare
c48ecb3
to
9b8cd86
Compare
9b8cd86
to
63dda65
Compare
48c9638
to
7c33e32
Compare
7c33e32
to
76a2495
Compare
76a2495
to
9e637d5
Compare
@@ -529,6 +530,14 @@ LogicalResult LowerHLOToLLVM(ModuleOp m, const DISCLoweringOptions& options) { | |||
pm.addNestedPass<FuncOp>(createCanonicalizerPass()); | |||
pm.addNestedPass<FuncOp>(createCSEPass()); | |||
pm.addNestedPass<FuncOp>(createCanonicalizerPass()); | |||
|
|||
bool enable_op_schedule = false; | |||
tensorflow::ReadBoolFromEnvVar("DISC_ENABLE_OP_SCHEDULE", enable_op_schedule, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pass cause any native optimization? If not, let's enable it by default, we have too many flags...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we havent found a case which schedule pass will cause negative effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -55,6 +55,24 @@ std::optional<std::string> ReductionKindToString(ReductionKind kind) { | |||
return std::nullopt; | |||
} | |||
|
|||
bool EnableAsyncCollective(Operation* op) { | |||
if (llvm::isa<mhlo::AllReduceOp>(op)) { | |||
if (const char* env_p = std::getenv("ENABLE_ASYNC_ALL_REDUCE")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all collective ops to be async? Do we need to separate sync and async collective ops ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
class Edge; | ||
|
||
struct GraphNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is latency hidden schedule in tensorflow source code, can we reuse the base class implementation?
ref: https://github.com/pai-disc/tensorflow/blob/features/bladedisc_rebase_20230202/tensorflow/compiler/xla/service/latency_hiding_scheduler.h
a43e5a4
to
ef6bdaf
Compare
ef6bdaf
to
5a6321c
Compare
b2effb7
to
e2a91b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with tiny comment, does not block merge.
#endif | ||
|
||
struct BaseCudaContextOption { | ||
ncclComm_t nccl_comm = nullptr; | ||
GpuStreamHandle stream = nullptr; | ||
GpuStreamHandle comm_stream = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comm_stream or compute_stream ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original stream
is the compute stream, and we add this new comm_stream
} | ||
} | ||
|
||
void InitilizeGrpahTopology(std::vector<Operation*>& post_order_instructions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will fixed it in next pr.
|
||
for (auto& block : main_func.getBody()) { | ||
for (int op_idx = 0; op_idx < scheduled_op_sequence.size(); op_idx++) { | ||
scheduled_op_sequence[op_idx]->moveBefore(&block.front()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the scheduled_op_sequence
contain all ops before scheduling? Does L983 loop mean using scheduled op sequence conver the old op sequence ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it contains all ops.
No description provided.