Open
Description
Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.
I use q_head = kv_head = 40, which is the standard attention for llama 13B.
I tried use_tensor_cores = True and False, I get nearly the same performance.
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
99.2 49025722 40 1225643.1 1222881.5 1219938 1231234 4342.4 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
0.8 374625 1 374625.0 374625.0 374625 374625 0.0 void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosE…
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
99.2 49189726 40 1229743.2 1229681.5 1228833 1231042 495.0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
0.8 379936 1 379936.0 379936.0 379936 379936 0.0 void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::Mas…
0.0 1600 1 1600.0 1600.0 1600 1600 0.0 void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
My question is:
-
Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?
-
I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?
void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosEncodingMode)0, (unsigned int)2, (unsigned int)4, (unsigned int)8, (unsigned int)16, (unsigned int)1, (unsigned int)8, (flashinfer::PageStorage)0, __half, __half, __half, int>(T10 *, T13 *, flashinfer::paged_kv_t<T9, T11, T13>, flashinfer::kv_partition_info_t<T13>, T12 *, float *, bool *, bool, int, float, float, float, float) (74, 40, 1)x(16, 1, 8), Context 1, Stream 7, Device 0, CC 9.0
Section: Command line profiler metrics
---------------------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------------------------------------- ----------- ------------
sm__inst_executed_pipe_tensor_op.sum (!) n/a
sm__inst_executed_pipe_tensor_op_dmma.sum inst 0
sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active % 0.11
sm__inst_executed_pipe_tensor_op_hmma.sum inst 166688
---------------------------------------------------------------------- ----------- ------------
void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::MaskMode)0, (flashinfer::PosEncodingMode)0, (unsigned int)1, (unsigned int)8, (unsigned int)2, (unsigned int)1, (unsigned int)4, (flashinfer::PageStorage)0, __half, __half, float, __half, int>(T14 *, T14 *, T14 *, T10 *, flashinfer::paged_kv_t<T9, T11, T14>, T14 *, unsigned char *, T14 *, T14 *, T14 *, T13 *, float *, bool *, T14 *, bool, flashinfer::uint_fastdiv, int, float, float, float, float) (74, 1, 40)x(32, 1, 4), Context 1, Stream 7, Device 0, CC 9.0
Section: Command line profiler metrics
---------------------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------------------------------------- ----------- ------------
sm__inst_executed_pipe_tensor_op.sum (!) n/a
sm__inst_executed_pipe_tensor_op_dmma.sum inst 0
sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active % 3.60
sm__inst_executed_pipe_tensor_op_hmma.sum inst 4899840
---------------------------------------------------------------------- ----------- ------------
Metadata
Metadata
Assignees
Labels
No labels