seq_len>50, fused w4a16 kernel not used? #210

frankxyy · 2023-12-04T17:29:11Z

frankxyy
Dec 4, 2023

https://github.com/turboderp/exllamav2/blob/master/exllamav2/exllamav2_ext/cuda/q_gemm.cu#L121

Hi, I am currently integrating the code to my project. One thing I am not certain about is that the calculation scheme of w4a16. From the code line above, it seems that when the input seq_len is above 50, the int4 to half convertion and the fp16 matmul is processed separately. So for most input seq_lens, fused w4a16 calculation is not utilized. Am I right?

turboderp · 2023-12-04T17:54:21Z

turboderp
Dec 4, 2023
Maintainer

The typical input sequence length is 1, so the quantized kernels are used whenever you're generating tokens one by one and the batch size is smaller than 50. The reason for the threshold (which may need adjusting, I'm not entirely sure 50 is still ideal given all the small tweaks I've made over time) is that at some point the operation starts being compute bound (GEMM) rather than memory bound (GEMV).

Even if I could write GEMM kernels as efficient as cuBLAS, the extra overhead of dequantizing on the fly into wmma::fragments, including register/SMEM overhead impacting occupancy limits, would ultimately make those kernels less efficient than ones that work directly on an FP16 matrix in global memory. Meanwhile, the cost of reconstructing the FP16 weights is constant, so there's always going to be a point at which it's more efficient to reconstruct and defer to cuBLAS (or Torch or whatever).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seq_len>50, fused w4a16 kernel not used? #210

{{title}}

Replies: 1 comment

{{title}}

Select a reply

seq_len>50, fused w4a16 kernel not used? #210

frankxyy Dec 4, 2023

Replies: 1 comment

turboderp Dec 4, 2023 Maintainer

frankxyy
Dec 4, 2023

turboderp
Dec 4, 2023
Maintainer