Element-wise matrix multiplication performance? #1181
-
Running the unaltered element-wise vector multiplication tutorial on my A100 machine produces these benchmark results: Now I make a minor 2-line modification to instead do element-wise multiplication. Line 45 of the kernel changes from Finally, I make a three-line modification to do matrix-matrix element-wise multiplication. Lines 118 and 119 of the benchmark become: x = torch.rand((size, size), device='cuda', dtype=torch.float32)
y = torch.rand((size, size), device='cuda', dtype=torch.float32) And change line 104 to keep the matrices within memory limits: The x axes here are not directly comparable. The first two show Is it expected that the matrix multiplication performance is orders of magnitude lower throughput? Is this a poor way of implementing a matrix-matrix element-wise multiplication in triton? It would appear pytorch performance suffers just as much, so unless I'm doing something wrong across the board this would appear to be expected? I may be missing some basic context, please recommend any relevant reading. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@xanderdunn Did you also modify the GB/s calculation? Line 124 would need to be changed to 12 * size^2 if you're changing it to a matrix. |
Beta Was this translation helpful? Give feedback.
@xanderdunn Did you also modify the GB/s calculation? Line 124 would need to be changed to 12 * size^2 if you're changing it to a matrix.