Replies: 1 comment
-
The typical input sequence length is 1, so the quantized kernels are used whenever you're generating tokens one by one and the batch size is smaller than 50. The reason for the threshold (which may need adjusting, I'm not entirely sure 50 is still ideal given all the small tweaks I've made over time) is that at some point the operation starts being compute bound (GEMM) rather than memory bound (GEMV). Even if I could write GEMM kernels as efficient as cuBLAS, the extra overhead of dequantizing on the fly into wmma::fragments, including register/SMEM overhead impacting occupancy limits, would ultimately make those kernels less efficient than ones that work directly on an FP16 matrix in global memory. Meanwhile, the cost of reconstructing the FP16 weights is constant, so there's always going to be a point at which it's more efficient to reconstruct and defer to cuBLAS (or Torch or whatever). |
Beta Was this translation helpful? Give feedback.
-
https://github.com/turboderp/exllamav2/blob/master/exllamav2/exllamav2_ext/cuda/q_gemm.cu#L121
Hi, I am currently integrating the code to my project. One thing I am not certain about is that the calculation scheme of w4a16. From the code line above, it seems that when the input seq_len is above 50, the int4 to half convertion and the fp16 matmul is processed separately. So for most input seq_lens, fused w4a16 calculation is not utilized. Am I right?
Beta Was this translation helpful? Give feedback.
All reactions