Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WSLowering] consumer release on each thread instead of master thread #10

Open
wants to merge 1 commit into
base: ws
Choose a base branch
from

Conversation

htyu
Copy link
Contributor

@htyu htyu commented Dec 16, 2024

Having each thread run the consumer release operation simplifies the logics by avoiding master thread id computation. This seems to help improve performance a bit.

  • 2% speedup for triton_tutorial_flash_v2_tma_ws
  • 0.3% speedup for matmul_layernorm_ws

@htyu htyu requested review from bertmaher and manman-ren December 16, 2024 17:20
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 16, 2024
remoteCTAId, false, 0);

auto arriveOp = builder.create<ttng::MBarrierArriveOp>(
loc, bufferEmpty, nullptr, nullptr, false, 0);
assert(op.getOperation()->hasAttr("async_task_id"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't quite get this logic around remote-cta. It seems this change gets rid of the remote-cta mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this gets rid of remote-cta. Do you think it is still useful? I'm not sure how it would be used.

unsigned bufferEmptyCount = numCTAs;
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, numCTAs);
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView,
THREADS_PER_TASK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this changes from a barrier across CTAs to a barrier within the warp group of 128 threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes from expecting a master thread form a WG running the barrier arrival to all threads within the WG running it.

unsigned bufferEmptyCount = numCTAs;
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, numCTAs);
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView,
THREADS_PER_TASK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on unchanged code. So for this later section
// Insert a cluster barrier before the kernel exits. Without this barrier,
// mbarrier_remote_arrive will fail if the remote CTA already exits.
Is it still valid with this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would fail I guess. I'm getting rid of remote CTA mode as I'm not sure how it would be used. Let me know if I'm missing anything.

@manman-ren
Copy link
Contributor

Nice perf win!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants