expand removing consecutive cast to handle meta operations in between #3644

jjsjann123 · 2024-12-24T23:06:04Z

Existing ConsecutiveCast optimization pass only optimize a consecutive cast operations. This PR expand ConsecutiveCast pass to handle cases where a chain of cast operations is broken by a meta operation in the middle.

e.g.

T1 = castOp(T0, fp32)
T2 = squeeze(T1)
T3 = castOp(T2, fp16)

The existing pass wouldn't be able to cancel out the two casts, because they are separated by the squeeze operation.

In this PR, before we trace back from the last CastOp for the chain of casts, we look at the input to the cast operation. If it's a movable meta operation, we swap the order of the meta op and the cast op first, then we resume the chain look up on consecutive casts.

jjsjann123 · 2024-12-25T04:20:48Z

!test

jjsjann123 · 2024-12-31T00:05:15Z

!test

…into HEAD

jjsjann123 · 2025-01-02T22:23:22Z

!test

…into HEAD

jjsjann123 · 2025-01-03T00:38:48Z

csrc/preseg_passes/consecutive_cast.cpp

+
+// replays meta operation on `new_in`. return the new output from replayed meta
+// operation
+Val* replayMetaOnNewInput(


This is an added function to replay the meta operation on the new input.

Squeeze/Broadcast/Set are all simple, but replaying of ViewOp requires replaying the transform, which justify having a separate function.

Is it possible to reuse/extend

Fuser/csrc/transform_replay.h

Line 295 in 37e7005

Expr* replayExprWithNewInput(Expr* e, Val* new_in);

?

jjsjann123 · 2025-01-03T00:39:56Z

csrc/preseg_passes/consecutive_cast.cpp

@@ -92,7 +189,92 @@ Val* replaceInputInCast(Val* cast_output, Val* new_input) {
 //
 //        b. otherwise, we can't bypass `lo_anchor` cast, we rewire this
 //        section as `starting_anchor`->`lo_anchor`->`expr->output(0)`
+Expr* moveChainedCasts(Expr* expr, std::unordered_set<Expr*>& visited) {


There's no code change here. Just mechanically moving this into a separate function.

jjsjann123 · 2025-01-03T00:40:35Z

csrc/preseg_passes/consecutive_cast.cpp

-    }
+
+      // optimize chained cast operations ending at expr
+      expr = moveChainedCasts(expr, visited);


The removed code has been moved into moveChainedCasts function.

jjsjann123 · 2025-01-03T00:41:50Z

!test

jjsjann123 · 2025-01-03T00:42:41Z

csrc/preseg_passes/consecutive_cast.cpp

+      //    T2 = castOp(T1, fp16)
+      //    T3 = squeeze(T2)
+      // and we can further cancel out the two cast ops.
+      if (isMovableMeta(expr->input(0)->definition())) {


This is the added logic where we are swapping the cast op with the meta ops.

naoyam · 2025-01-03T07:28:11Z

Haven't actually looked at the code yet, but some general question first.

If it's a movable meta operation, we swap the order of the meta op and the cast op first,

Does that mean the reordering is done no matter if that would lead to consecutive cast ops? If so, if given a fusion like:

t0: fusion input tv of type bf16
t1_bf16 = reshape(t0); 
t2_bf16 = reshape(t1_bf16);
t3_bf16 = reshape(t2_bf16);
t4_bf16 = reshape(t3_bf16);
t5_bf16 = reshape(t4_bf16);
t6_fp32 = bf16ToFp32(t5); // fusion output

Am I right that this PR would change this fusion to:

t0: fusion input tv of type bf16
t0_fp32 = bf16ToFp32(t0);
t1_fp32 = reshape(t0_fp32); 
t2_fp32 = reshape(t1_fp32);
t3_fp32 = reshape(t2_fp32);
t4_fp32 = reshape(t3_fp32);
t5_fp32 = reshape(t4_fp32);
t6_fp32 = t5_fp32; // fusion output

I guess this change wouldn't impact anything much, but I'm not sure why we should do this. Since reshape is not a true meta operaton in nvFuser, using a higher precision unnecessarily doesn't seem like an optimization.

jjsjann123 · 2025-01-03T10:21:48Z

Haven't actually looked at the code yet, but some general question first.

If it's a movable meta operation, we swap the order of the meta op and the cast op first,

Does that mean the reordering is done no matter if that would lead to consecutive cast ops? If so, if given a fusion like:
t0: fusion input tv of type bf16
t1_bf16 = reshape(t0); 
t2_bf16 = reshape(t1_bf16);
t3_bf16 = reshape(t2_bf16);
t4_bf16 = reshape(t3_bf16);
t5_bf16 = reshape(t4_bf16);
t6_fp32 = bf16ToFp32(t5); // fusion output
Am I right that this PR would change this fusion to:
t0: fusion input tv of type bf16
t0_fp32 = bf16ToFp32(t0);
t1_fp32 = reshape(t0_fp32); 
t2_fp32 = reshape(t1_fp32);
t3_fp32 = reshape(t2_fp32);
t4_fp32 = reshape(t3_fp32);
t5_fp32 = reshape(t4_fp32);
t6_fp32 = t5_fp32; // fusion output
I guess this change wouldn't impact anything much, but I'm not sure why we should do this. Since reshape is not a true meta operaton in nvFuser, using a higher precision unnecessarily doesn't seem like an optimization.

That's a great point.

Given the pattern that consecutive cast pass is targeting is upCast -> downCast. I think I can update the logic to only propagate downCast to input. In which case we are reducing intermediate buffer size, which seems to be ~~a strict~~ more like an optimization then?

jjsjann123 · 2025-01-03T21:41:03Z

!test

jjsjann123 · 2025-01-03T21:42:14Z

The benchmark failure is coming from some segmentation. i.e. there's some set->cast pattern in NvFuserScheduler_TIMM_vit_base_patch16_224_bcast5_NCHW___GRAPH/NvFuserScheduler_TIMM_vit_base_patch16_224_bcast5_NCHW/64/197/768/manual_time, which now would throw some no-op segments after the reorder.

I patched that in #3670

wujingyue · 2025-01-06T19:27:09Z

propagate downCast to input

Good idea. In addition, you could propagate up casts to outputs. Hopefully, after propagating up and down, the cancellable casts will be adjacent and be trivial to remove.

(It's certainly fine to leave this for the future.)

wujingyue · 2025-01-06T19:28:21Z

csrc/ir/utils.cpp

+  return ldst->opType() == LoadStoreOpType::Set && in_tv != nullptr &&
+      out_tv != nullptr
+      // The hasRoot() check is to prevent picking up Set.Permute ops here
+      && !ldst->out()->as<TensorView>()->hasRoot();


Suggested change

&& !ldst->out()->as<TensorView>()->hasRoot();

&& !out_tv->hasRoot();

wujingyue · 2025-01-06T19:30:30Z

csrc/ir/utils.cpp

+  }
+  auto in_tv = dynamic_cast<TensorView*>(ldst->in());
+  auto out_tv = dynamic_cast<TensorView*>(ldst->out());
+  return ldst->opType() == LoadStoreOpType::Set && in_tv != nullptr &&


Nit: I'd split this into a series of early-exit checks for clarity and easy debugging. For example,

if (ldst->opType() != Set) { return false; } auto in_tv = ...; if (in_tv == nullptr) { return false; } auto out_tv = ...; if (out_tv == nullptr) { return false; } ...

wujingyue · 2025-01-06T19:36:05Z

csrc/preseg_passes/consecutive_cast.cpp

@@ -20,6 +25,113 @@ bool isCast(Expr* expr) {
  return false;
 }

+// for pattern `expr -> cast`, this function returns whether to replace it with
+// `cast -> expr`
+bool swapMetaCast(Expr* cast) {


Suggested change

bool swapMetaCast(Expr* cast) {

bool shouldSwapMetaCast(Expr* cast) {

since it doesn't perform the swap.

wujingyue · 2025-01-06T19:37:39Z

csrc/preseg_passes/consecutive_cast.cpp

+    return false;
+  }
+
+  Expr* expr = cast->input(0)->definition();


Suggested change

Expr* expr = cast->input(0)->definition();

Expr* meta = cast->input(0)->definition();

Nit: the name expr is too generic and lacks intention.

wujingyue · 2025-01-06T19:42:59Z

csrc/preseg_passes/consecutive_cast.cpp

+
+// replays meta operation on `new_in`. return the new output from replayed meta
+// operation
+Val* replayMetaOnNewInput(


Is it possible to reuse/extend

Fuser/csrc/transform_replay.h

Line 295 in 37e7005

Expr* replayExprWithNewInput(Expr* e, Val* new_in);

?

wujingyue · 2025-01-06T19:49:48Z

csrc/preseg_passes/consecutive_cast.cpp

-    }
+    do {
+      // when cast op expr is following a meta operation that's safe to be
+      // swapped, we do so hoping it would place the cast op to another cast op


I think "hoping" is inaccurate. Even if we don't find a cancellable upcast, it's still better/neutral to move a downcast up for a potentially smaller intermediate buffer size.

wujingyue · 2025-01-06T19:58:27Z

csrc/preseg_passes/consecutive_cast.cpp

-        continue;
-      }
+        // We do not support the replay if expr out has non-trivial transforms
+        // between its logical_dom to alloc_dom.


Non-permuting allocation domains will the norm for multi-GPU fusions with DID loop split. Anything you can do to save my future time will be greatly appreciated!

wujingyue · 2025-01-06T20:09:17Z

csrc/preseg_passes/consecutive_cast.cpp

+      //    T2 = castOp(T1, fp16)
+      //    T3 = squeeze(T2)      // operation in reduced precision
+      // and we can further cancel out the two cast ops.
+      if (swapMetaCast(expr)) {


The current logic

do { if (shouldSwapMetaCast(expr)) { ... expr = swapMetaCast(...) ... } expr = removeDoubleCasts(...) } while (canSwapMetaCast(expr));

is a bit convoluted.

I think it can be simplified by separating moving upcasts and removing roundtrip casts. For example,

for each expr in backward order { while (shouldSwapMetaCast(expr)) { ... expr = swapMetaCast(...); ... } } ... for each expr { if (expr is a roundtrip cast) { redirect expr's consumers to expr's input's input. } }

wujingyue · 2025-01-06T20:14:05Z

csrc/preseg_passes/consecutive_cast.cpp


-      // adding prev_expr to visited node so we'll short-cut it.
-      visited.insert(prev_expr);


I'm unsure we need or will still need visited.

jjsjann123 added 8 commits December 24, 2024 14:20

WIP

9fca5d7

WIP

5747679

WIP

1e0c338

WIP

17d9e39

WIP

dd0573b

avoid segfault via accessing nullptr

27f1726

WIP

738abb7

fixing bugs

4685f95

jjsjann123 added 6 commits December 30, 2024 14:16

Merge branch 'main' into preseg_passes_consecutive_cast

e8f468d

fixing output dtype

6b651bd

switch to use mutator directly

6b1e4ba

does this one work?

2d1465d

is -> isA

5c5cc45

WIP

072bde0

jjsjann123 added 14 commits December 31, 2024 10:05

Merge branch 'main' into preseg_passes_consecutive_cast

9cc5926

Merge branch 'main' into preseg_passes_consecutive_cast

2e80650

adding support for a simple set

77eadb6

Merge remote-tracking branch 'origin/preseg_passes_consecutive_cast' …

3d22eed

…into HEAD

adding transform replay for view op

0dbfc3b

fixing transform replay

00df39a

fixing tests

3154894

adding more tests

d3f9ea7

missing include

0a38cc8

fixing fusion guard for optimization passes

2c0c249

WIP

b085ce1

WIP

da20f49

WIP

d632a59

fixing test

ec47087

jjsjann123 added 3 commits January 2, 2025 12:58

WIP

2b2de44

reverting test changes

44ff47f

Merge branch 'main' into preseg_passes_consecutive_cast

63bfeda

jjsjann123 added 2 commits January 2, 2025 16:36

adding another test with multiple segments

89d5223

Merge remote-tracking branch 'origin/preseg_passes_consecutive_cast' …

e49297c

…into HEAD

jjsjann123 commented Jan 3, 2025

View reviewed changes

renaming function

abb8c54

jjsjann123 marked this pull request as ready for review January 3, 2025 00:41

jjsjann123 requested review from jacobhinkle, naoyam and wujingyue January 3, 2025 00:41

jjsjann123 commented Jan 3, 2025

View reviewed changes

jjsjann123 changed the title ~~Preseg passes consecutive cast~~ expand removing consecutive cast to handle meta operations in between Jan 3, 2025

jjsjann123 mentioned this pull request Jan 3, 2025

expand RemoveBcastSqueeze to handle unary operations between broadcast/squeeze ops #3643

Open

jjsjann123 added 6 commits January 3, 2025 10:12

Merge branch 'main' into preseg_passes_consecutive_cast

1e51cf4

addressing review comments on increased buffer size

ef0b7a2

fixing build

68f27c7

fixing tests

d83c772

removing print

8686342

avoid moving cast for scalar

6b5aad2

wujingyue reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expand removing consecutive cast to handle meta operations in between #3644

expand removing consecutive cast to handle meta operations in between #3644

jjsjann123 commented Dec 24, 2024 •

edited

Loading

jjsjann123 commented Dec 25, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Jan 2, 2025

jjsjann123 Jan 3, 2025

wujingyue Jan 6, 2025

jjsjann123 Jan 3, 2025

jjsjann123 Jan 3, 2025

jjsjann123 commented Jan 3, 2025

jjsjann123 Jan 3, 2025

naoyam commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025 •

edited

Loading

jjsjann123 commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

wujingyue commented Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

wujingyue Jan 6, 2025

	&& !ldst->out()->as<TensorView>()->hasRoot();
	&& !out_tv->hasRoot();

	bool swapMetaCast(Expr* cast) {
	bool shouldSwapMetaCast(Expr* cast) {

	Expr* expr = cast->input(0)->definition();
	Expr* meta = cast->input(0)->definition();


		// adding prev_expr to visited node so we'll short-cut it.
		visited.insert(prev_expr);

expand removing consecutive cast to handle meta operations in between #3644

Are you sure you want to change the base?

expand removing consecutive cast to handle meta operations in between #3644

Conversation

jjsjann123 commented Dec 24, 2024 • edited Loading

jjsjann123 commented Dec 25, 2024

jjsjann123 commented Dec 31, 2024

jjsjann123 commented Jan 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Jan 3, 2025

Choose a reason for hiding this comment

naoyam commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025 • edited Loading

jjsjann123 commented Jan 3, 2025

jjsjann123 commented Jan 3, 2025

wujingyue commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Dec 24, 2024 •

edited

Loading

jjsjann123 commented Jan 3, 2025 •

edited

Loading