-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WSLowering] consumer release on each thread instead of master thread #10
base: ws
Are you sure you want to change the base?
Conversation
remoteCTAId, false, 0); | ||
|
||
auto arriveOp = builder.create<ttng::MBarrierArriveOp>( | ||
loc, bufferEmpty, nullptr, nullptr, false, 0); | ||
assert(op.getOperation()->hasAttr("async_task_id")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't quite get this logic around remote-cta. It seems this change gets rid of the remote-cta mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this gets rid of remote-cta. Do you think it is still useful? I'm not sure how it would be used.
unsigned bufferEmptyCount = numCTAs; | ||
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, numCTAs); | ||
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, | ||
THREADS_PER_TASK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this changes from a barrier across CTAs to a barrier within the warp group of 128 threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes from expecting a master thread form a WG running the barrier arrival to all threads within the WG running it.
unsigned bufferEmptyCount = numCTAs; | ||
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, numCTAs); | ||
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, | ||
THREADS_PER_TASK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on unchanged code. So for this later section
// Insert a cluster barrier before the kernel exits. Without this barrier,
// mbarrier_remote_arrive will fail if the remote CTA already exits.
Is it still valid with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would fail I guess. I'm getting rid of remote CTA mode as I'm not sure how it would be used. Let me know if I'm missing anything.
Nice perf win! |
Having each thread run the consumer release operation simplifies the logics by avoiding master thread id computation. This seems to help improve performance a bit.