You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our project can be considered a dynamic runtime kernel library, which can generate different executables for specific shape and devices on-the-fly. BitBLAS enables Ladder to propagate layout based on the compute expression and target hardware instructions to avoid bank conflict and make sure the global memory load is coalesced as possible. However, our policy and schedule cannot achieve ideal performance when the shape is small, as the parallelism is limited, which lead to an awkward situation where GEMV and GEMM use different instructions (for example, GEMV uses simt while mma be applied on GEMM), the propagated layout for gemm may not optimal for gemv.
Currently, to preserve the optimal performance of GEMV, we have disabled weight propagation when the input M falls within a dynamic range. However, there is a growing trend of increased attention towards the performance of contiguous decoding. And in some projects, like the flute, bitblas has a weak performance when they do benchmarking with a preset dynamic input range.
So its time for us to decide whether should us make a hotfix to open the weight propagation by default, to improve the performance of batched dequantize gemv?
Our project can be considered a dynamic runtime kernel library, which can generate different executables for specific shape and devices on-the-fly. BitBLAS enables Ladder to propagate layout based on the compute expression and target hardware instructions to avoid bank conflict and make sure the global memory load is coalesced as possible. However, our policy and schedule cannot achieve ideal performance when the shape is small, as the parallelism is limited, which lead to an awkward situation where GEMV and GEMM use different instructions (for example, GEMV uses simt while mma be applied on GEMM), the propagated layout for gemm may not optimal for gemv.
Currently, to preserve the optimal performance of GEMV, we have disabled weight propagation when the input M falls within a dynamic range. However, there is a growing trend of increased attention towards the performance of contiguous decoding. And in some projects, like the flute, bitblas has a weak performance when they do benchmarking with a preset dynamic input range.
So its time for us to decide whether should us make a hotfix to open the weight propagation by default, to improve the performance of batched dequantize gemv?
TODO Items:
Implement a benchmarking ci ref to [Feature Request] CI/CD Request for our git pipeline #66 , that we can understand how the hot fix can have impact of a operator sets.
Survey the efficient implementation of flute and marlin, checkout if we can reproduce it with tvm.
The text was updated successfully, but these errors were encountered: