Triton Performance Worse on Split Activation in Forward Pass #1186

xanderdunn · 2023-02-12T22:32:00Z

xanderdunn
Feb 12, 2023

Running this benchmark file as-is produces this output:

This is the forward pass in pytorch:

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          z1, z2 = torch.chunk(Z, 2, dim=(Z.ndim - 1))
          A = z1 * self.gelu(z2)
          return A

The equivalent triton kernel implementation is the gelu_partial_layer_fused_forward function.

I'm surprised that the triton performance is so much worse. Do you see any issues with the kernel implementation? This is a small modification on the provided matmul tutorial. I wonder if it's perhaps related to the experience in #984 where @jmc128 found that having two accumulators harmed the triton kernel performance. This is essentially what I have here where accumulator_left is z1 and accumulator_right is z2.

I'm running on latest master commit 3fa8a5a864c48a490625648387a86be3eb7c2c06 built from source. This is running on a GCP machine with a single A100. Ubuntu 22.04. Python 3.8.

xanderdunn · 2023-02-13T17:55:35Z

xanderdunn
Feb 13, 2023
Author

I ran some additional tests to narrow down the source of the performance gap. I modified the pytorch and triton kernel so that there is no longer a split accumulator:

Matmul Only

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          return Z

Benchmark for matmul only:

Matmul + gelu

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          A = self.gelu(Z)
          return A

Benchmark for gelu and matmul:

Matmul + gelu + element-wise multiplication

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          A = Z * self.gelu(Z)
          return A

Benchmark for matmul and gelu and element-wise multiplication:

That's everything except for the chunking of Z into two pieces, which I have either implemented poorly or it's not handled well by triton:

    x_ptrs = x_ptr + (offs_xm[:, None] * stride_xm + offs_k[None, :] * stride_xk)
    W_left_ptrs = W_ptr + (offs_k[:, None] * stride_Wk + offs_Wn[None, :] * stride_Wn)
    # offset the start by half of the number of columns
    W_right_ptrs = W_ptr + (N // 2) + (offs_k[:, None] * stride_Wk + offs_Wn[None, :] * stride_Wn)

    z1 = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    z2 = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    # Matrix multiply x and W, accumulating the first half of columns into z1 and
    # the second half of columns into z2
    for _ in range(0, K, BLOCK_SIZE_K): # type: ignore
        x = tl.load(x_ptrs)

        W_left = tl.load(W_left_ptrs)
        z1 += tl.dot(x, W_left)

        W_right = tl.load(W_right_ptrs)
        z2 += tl.dot(x, W_right)

        # Advance the ptrs to the next K block
        x_ptrs += BLOCK_SIZE_K * stride_xk
        W_left_ptrs += BLOCK_SIZE_K * stride_Wk
        W_right_ptrs += BLOCK_SIZE_K * stride_Wk

I wonder if perhaps I am duplicating work in the calculation of z1.

Chunk and use only z2

If I constrain it to calculating only z2, the second half of columns:

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          _, z2 = torch.chunk(Z, 2, dim=(Z.ndim - 1))
          A = z2 * self.gelu(z2)
          return A

Performance is worse than pytorch, but not much worse. The only change is to push forward the W_ptrs to the second half of the columns:

W_ptrs = W_ptr + (N // 2) + (offs_k[:, None] * stride_Wk + offs_Wn[None, :] * stride_Wn)

Chunk and use only z1

If I constrain to calculating only z1, the first half of the columns:

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          z1, _ = torch.chunk(Z, 2, dim=(Z.ndim - 1))
          A = z1 * self.gelu(z1)
          return A

Similar to z2 only, this is only slightly worse than the pytorch.

Chunk and use both z1 and z2

      def forward(self, x: torch.Tensor) -> torch.Tensor:
          Z = self.linear(x)
          z1, z2 = torch.chunk(Z, 2, dim=(Z.ndim - 1))
          A = z2 * self.gelu(z2)
          return A

Now with the two previous combined it's very bad performance, as seen in the original post:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton Performance Worse on Split Activation in Forward Pass #1186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Triton Performance Worse on Split Activation in Forward Pass #1186

xanderdunn Feb 12, 2023

Replies: 1 comment

xanderdunn Feb 13, 2023 Author

Matmul Only

Matmul + gelu

Matmul + gelu + element-wise multiplication

Chunk and use only z2

Chunk and use only z1

Chunk and use both z1 and z2

xanderdunn
Feb 12, 2023

xanderdunn
Feb 13, 2023
Author