Fix rfactor replay for DID loop split #3543

wujingyue · 2024-12-07T03:47:37Z

For #2563

naoyam · 2024-12-10T04:35:37Z

What does the actual fusion that is passed to the reduction scheduler look like?

github-actions · 2025-02-10T06:07:24Z

Review updated until commit f45bf68

Description

Update rfactor replay logic for DID loop split
Enhance test coverage with dynamic sizes
Correct tutorial comments and remove redundant checks
Improve allreduce test with dynamic tensor sizes

Changes walkthrough 📝

Relevant files

Bug fix

reduction.cpp `Update 3D reduction error check` csrc/scheduler/reduction.cpp Update error check for 3D reduction schedules with DIDx sharding	+2/-1
tensor_view.cpp `Remove redundant rfactor check` csrc/tensor_view.cpp Remove redundant check for rfactor on the same view	+0/-6

Enhancement

transform_rfactor.cpp `Simplify IterType assignment` csrc/transform_rfactor.cpp Simplify IterType assignment in ReplayRFactor	+2/-2
test_communication.py `Enhance allreduce test` tests/python/test_communication.py Enhance allreduce test with dynamic tensor sizes and rfactor Add device mesh and parallelization for rfactor output	+18/-4

Documentation

test_tutorial.cpp `Update tutorial comments` tests/cpp/test_tutorial.cpp Update comments and print method in reduction rfactor test	+4/-4

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The new error check might be too strict. The original check was isSharded(reduction_tv), which checks if the tensor view is sharded. The new check getShardedLoopAxis(reduction_tv, ParallelType::DIDx) >= 0 checks for a specific sharded loop axis. This might not cover all cases where the tensor view is sharded.

!(rparams->schedule_3D &&
  getShardedLoopAxis(reduction_tv, ParallelType::DIDx) >= 0),

Removed Check

The check for domain()->hasRoot() was removed. This check ensures that rfactor is not called on the same view twice, which might be important for maintaining the correctness of the tensor view transformations.

    definition() != nullptr &&
        (definition()
             ->isStrictlyOneOf<ReductionOp, MmaOp, MatmulOp, LinearOp>()),
    "Error rfactoring ",
    this,
    " its definition is either a nullptr or not a reduction.");
NVF_CHECK(
    !definition()->isA<GroupedReductionOp>(),
    "For GroupedReductionOp, use TensorView::rFactor(const std::vector<int64_t>& axes, const std::vector<TensorView*>& tvs)");

Hardcoded Values

The test test_allreduce uses hardcoded values for the tensor dimensions (m = d * 2, n = 3). This might limit the test's ability to catch issues with different input sizes. Consider using more dynamic or parameterized test cases.

def test_allreduce(multidevice_test):
    d = multidevice_test.size
    mesh = nvfuser.DeviceMesh(range(d))

    class Model(FusionDefinition):
        def definition(self):
            self.inp = self.define_tensor(
                (-1, -1), contiguity=True, dtype=DataType.Float
            )
            self.out = self.ops.sum(self.inp, [0])
            self.add_output(self.out)

        def multidevice_schedule(self):
            self.sched.split(self.inp, 0, d, False)
            self.sched.split(self.out, 0, d, False)
            out_local = self.sched.rfactor(self.out, [1])

            self.sched._set_device_mesh(self.inp, mesh)
            self.sched._set_device_mesh(self.out, mesh)
            self.sched._set_device_mesh(out_local, mesh)

            self.sched.parallelize(self.inp, 0, nvfuser.ParallelType.mesh_x)
            self.sched.parallelize(out_local, 0, nvfuser.ParallelType.mesh_x)

            self.sched.set_allocation_as_loop(self.inp)
            self.sched.set_allocation_as_loop(out_local)
            self.sched.set_allocation_as_loop(self.out)

    m = d * 2
    n = 3
    unsharded = torch.randn(m, n)
    sharded = multidevice_test.shard_tensor(unsharded, 0, mesh)

    fd = Model()
    outputs = fd.execute([sharded])
    torch.testing.assert_close(outputs[0].local.cpu(), unsharded.sum(0))

wujingyue · 2025-02-14T00:20:32Z

!test

wujingyue · 2025-02-14T21:54:47Z

csrc/tensor_view.cpp

@@ -779,12 +779,6 @@ TensorView* TensorView::rFactor(const std::vector<int64_t>& axes) {
      "Error rfactoring ",
      this,
      " its definition is either a nullptr or not a reduction.");
-  // For hopper matmuls, the mma_result logical domain is reordered as [M, N, K]


I removed this check because we now expect rFactor to be called by both inter- and intra-GPU schedulers.

wujingyue · 2025-02-14T22:22:01Z

csrc/transform_rfactor.cpp

@@ -121,12 +121,12 @@ class ReplayRFactor : public ReplayTransformations {
    // rfactored domains. If it isn't involved in the rfactor, it's no
    // longer a redunction domain
    std::optional<IterType> outer_iter_type;
-    if (s->outer()->isReduction() && !rfactor_dep_ids_.count(s->outer())) {


Without this, I ran into an error with the following local reduction:

in: root/logical=[i{n}], loop=[iDIDx{d}, i{n/d}] out = reduction(in): root=[r{n}], logical/loop=[iDIDx{d}, r{n/d}]

The reduction scheduler tries to schedule out on TIDx

out: root=[r{n}], logical=[iDIDx{d}, r{n/d}], loop=[iDIDx{d}, r{n/d/blockDim.x}, rTIDx{blockDim.x}]

and then rFactor axis 1, i.e., r{n/d/blockDim.x}.

rFactor tries to replay all transforms using ReplayRFactor on a new, identical root domain [r{n}]. Without this change, the outer-split by d produced rDIDx{d} instead of iDIDx{d}.

wujingyue · 2025-02-14T22:23:44Z

!test

wujingyue · 2025-02-15T01:20:52Z

What does the actual fusion that is passed to the reduction scheduler look like?

I finally debugged this through. PTAL!

wujingyue mentioned this pull request Dec 7, 2024

ReduceScatter with DID loop split #3504

Merged

wujingyue force-pushed the wjy/rs branch 2 times, most recently from 6d03163 to 66a3363 Compare December 10, 2024 15:56

Base automatically changed from wjy/rs to main December 11, 2024 02:26

wujingyue force-pushed the wjy/rfactor branch from 95eb150 to 5ca48ee Compare February 10, 2025 06:06

wujingyue force-pushed the wjy/rfactor branch 2 times, most recently from 678b0dd to 237980f Compare February 14, 2025 00:19

wujingyue changed the title ~~Attempt to use rFactor for allreduce~~ Fix rfactor replay for DID loop split Feb 14, 2025

wujingyue marked this pull request as ready for review February 14, 2025 00:33

Attempt to use rFactor for allreduce

808a295

wujingyue force-pushed the wjy/rfactor branch from 237980f to 808a295 Compare February 14, 2025 21:22

wujingyue added 4 commits February 14, 2025 13:51

Remove the double rfactor check

6f88e5f

Harden the test with dynamic sizes

28b5aee

Fix tutorial comments

f3cd3b2

Remove the other test that's less realistic

cb6f442

wujingyue commented Feb 14, 2025

View reviewed changes

wujingyue requested review from naoyam and Priya2698 February 14, 2025 22:22

lint

f45bf68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rfactor replay for DID loop split #3543

Fix rfactor replay for DID loop split #3543

wujingyue commented Dec 7, 2024 •

edited

Loading

naoyam commented Dec 10, 2024

github-actions bot commented Feb 10, 2025 •

edited

Loading

wujingyue commented Feb 14, 2025

wujingyue Feb 14, 2025

wujingyue Feb 14, 2025

wujingyue commented Feb 14, 2025

wujingyue commented Feb 15, 2025

Fix rfactor replay for DID loop split #3543

Are you sure you want to change the base?

Fix rfactor replay for DID loop split #3543

Conversation

wujingyue commented Dec 7, 2024 • edited Loading

naoyam commented Dec 10, 2024

github-actions bot commented Feb 10, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

wujingyue commented Feb 14, 2025

wujingyue Feb 14, 2025

Choose a reason for hiding this comment

wujingyue Feb 14, 2025

Choose a reason for hiding this comment

wujingyue commented Feb 14, 2025

wujingyue commented Feb 15, 2025

wujingyue commented Dec 7, 2024 •

edited

Loading

github-actions bot commented Feb 10, 2025 •

edited

Loading