You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nanoflow overlaps decode/prefill/communication by limiting the number of SMs each kernel uses (in practice it's controlled by grid size), current nanoflow implementation modifies flashinfer kernels to support launching flashinfer kernels with specified grid size.
As flashinfer changes all kernel implementation to persistent kernels, we can support specifying the number of SM's at flashinfer side. More specifically, we can add an argument num_ctas at our plan functions to specify the grid size, and user can directly control it in Python.
The benefit of this feature include:
Keep nanoflow's development in pace with latest flashinfer features (JIT/FA3/customization/etc).
Making it possible to port nanoflow to pytorch. It may sacrifice some performance but I think overall it's good for nanoflow's adoption.
Making it possible to use nanoflow-style parallelism in other llm serving frameworks such as vllm/sglang/mlc-llm/etc.
We also need to support such arguments in GEMM APIs by wrapping cutlass gemm implementations, leave them for future work.
Nanoflow overlaps decode/prefill/communication by limiting the number of SMs each kernel uses (in practice it's controlled by grid size), current nanoflow implementation modifies flashinfer kernels to support launching flashinfer kernels with specified grid size.
As flashinfer changes all kernel implementation to persistent kernels, we can support specifying the number of SM's at flashinfer side. More specifically, we can add an argument
num_ctas
at ourplan
functions to specify the grid size, and user can directly control it in Python.The benefit of this feature include:
We also need to support such arguments in GEMM APIs by wrapping cutlass gemm implementations, leave them for future work.
cc @serendipity-zk @happierpig
The text was updated successfully, but these errors were encountered: