Triton Performance Worse on Split Activation in Forward Pass #1186
Unanswered
xanderdunn
asked this question in
Q&A
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Running this benchmark file as-is produces this output:
![partial-gelu-performance-prev](https://user-images.githubusercontent.com/1313618/218340712-da01b663-8a99-4249-9254-edaa0a5586ed.png)
This is the forward pass in pytorch:
The equivalent triton kernel implementation is the gelu_partial_layer_fused_forward function.
I'm surprised that the triton performance is so much worse. Do you see any issues with the kernel implementation? This is a small modification on the provided matmul tutorial. I wonder if it's perhaps related to the experience in #984 where @jmc128 found that having two accumulators harmed the triton kernel performance. This is essentially what I have here where
accumulator_left
isz1
andaccumulator_right
isz2
.I'm running on latest master commit
3fa8a5a864c48a490625648387a86be3eb7c2c06
built from source. This is running on a GCP machine with a single A100. Ubuntu 22.04. Python 3.8.Beta Was this translation helpful? Give feedback.
All reactions