Skip to content

Question about use_tensor_cores = True or False #520

Open
@sleepwalker2017

Description

@sleepwalker2017

Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.

I use q_head = kv_head = 40, which is the standard attention for llama 13B.

I tried use_tensor_cores = True and False, I get nearly the same performance.

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49025722         40  1225643.1  1222881.5   1219938   1231234       4342.4  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           374625          1   374625.0   374625.0    374625    374625          0.0  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosE…
 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49189726         40  1229743.2  1229681.5   1228833   1231042        495.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           379936          1   379936.0   379936.0    379936    379936          0.0  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::Mas…
      0.0             1600          1     1600.0     1600.0      1600      1600          0.0  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…

My question is:

  1. Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?

  2. I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?

  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosEncodingMode)0, (unsigned int)2, (unsigned int)4, (unsigned int)8, (unsigned int)16, (unsigned int)1, (unsigned int)8, (flashinfer::PageStorage)0, __half, __half, __half, int>(T10 *, T13 *, flashinfer::paged_kv_t<T9, T11, T13>, flashinfer::kv_partition_info_t<T13>, T12 *, float *, bool *, bool, int, float, float, float, float) (74, 40, 1)x(16, 1, 8), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         0.11
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst       166688
    ---------------------------------------------------------------------- ----------- ------------
  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::MaskMode)0, (flashinfer::PosEncodingMode)0, (unsigned int)1, (unsigned int)8, (unsigned int)2, (unsigned int)1, (unsigned int)4, (flashinfer::PageStorage)0, __half, __half, float, __half, int>(T14 *, T14 *, T14 *, T10 *, flashinfer::paged_kv_t<T9, T11, T14>, T14 *, unsigned char *, T14 *, T14 *, T14 *, T13 *, float *, bool *, T14 *, bool, flashinfer::uint_fastdiv, int, float, float, float, float) (74, 1, 40)x(32, 1, 4), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         3.60
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst      4899840
    ---------------------------------------------------------------------- ----------- ------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions