Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port oss f16_fast_gemv into fbcode #3610

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

YUNQIUGUO
Copy link

Summary:
This diff content includes:

  1. Port OSS FastGEMV fp16 kernel into fbcode and expose to python as a step 1 - torch.ops.fbgemm.f16_fast_gemv
    https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
  2. Add fp16_oss_fast_gemv to quantize ops benchmark script
  3. Add two simple tests for custom optorch.ops.fbgemm.f16_fast_gemv to test
    • torch.compile() able
    • correctness

Next step:
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

Copy link

netlify bot commented Jan 23, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 4851e7a
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/679d80f0e4e69a0008fa51d0
😎 Deploy Preview https://deploy-preview-3610--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 24, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 24, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 27, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 28, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers: P1720649201 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 28, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers: P1720649201 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 30, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers: P1722119058 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

heuristic sweep results from the 4 problem sizes we care about:

P1722043272


**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 30, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers: P1722119058 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

======================

### Benchmark Results

| **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 |
| 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 |
| 1 | 1280 | 8192 | cuda_lite | 0.015 | 1.357 | 679.032 |
| 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 |
| 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 |
| 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 |
| 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 |
| 1 | 8192 | 1024 | cuda_lite | 0.014 | 1.198 | 600.054 |
| 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 |
| 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 |
| 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 |
| 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 |
| 1 | 7168 | 8192 | cuda_lite | 0.044 | 2.679 | 1340.093 |
| 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 |
| 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 |
| 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 |
| 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 |
| 1 | 8192 | 3584 | cuda_lite | 0.026 | 2.271 | 1136.151 |
| 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 |
| 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 |




heuristic sweep results from the 4 problem sizes we care about:
P1722043272

**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 31, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

======================

### Benchmark Results

| **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 |
| 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 |
| 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 |
| 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 |
| 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 |
| 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 |
| 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 |
| 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 |
| 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 |
| 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 |
| 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 |
| 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 |
| 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 |
| 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 |
| 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 |
| 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 |
| 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 |
| 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 |
| 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 |
| 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 |

Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result.


heuristic sweep results from the 4 problem sizes we care about:

P1722806148


**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 31, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

======================

### Benchmark Results

| **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 |
| 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 |
| 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 |
| 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 |
| 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 |
| 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 |
| 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 |
| 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 |
| 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 |
| 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 |
| 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 |
| 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 |
| 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 |
| 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 |
| 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 |
| 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 |
| 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 |
| 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 |
| 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 |
| 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 |

Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result.


heuristic sweep results from the 4 problem sizes we care about:

P1722806148


**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Jan 31, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

======================

### Benchmark Results

| **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 |
| 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 |
| 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 |
| 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 |
| 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 |
| 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 |
| 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 |
| 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 |
| 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 |
| 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 |
| 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 |
| 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 |
| 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 |
| 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 |
| 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 |
| 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 |
| 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 |
| 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 |
| 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 |
| 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 |

Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result.


heuristic sweep results from the 4 problem sizes we care about:

P1722806148


**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

2 similar comments
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

YUNQIUGUO added a commit to YUNQIUGUO/FBGEMM that referenced this pull request Feb 1, 2025
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

======================

### Benchmark Results on H100


| **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 |
| 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 |
| 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 |
| 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 |
| 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 |
| 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 |
| 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 |
| 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 |
| 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 |
| 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 |
| 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 |
| 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 |
| 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 |
| 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 |
| 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 |
| 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 |
| 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 |
| 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 |
| 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 |
| 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 |

Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result.

Also no_cuda_graph flag enabled when running the quantize_bench

heuristic sweep results from the 4 problem sizes we care about:
P1722806148


**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
Summary:
X-link: facebookresearch/FBGEMM#688


This diff content includes:
1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv`
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script
3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test
     - `torch.compile()` able
     -  correctness

Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4`

======================

### Benchmark Results on H100


| **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 |
| 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 |
| 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 |
| 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 |
| 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 |
| 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 |
| 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 |
| 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 |
| 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 |
| 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 |
| 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 |
| 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 |
| 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 |
| 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 |
| 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 |
| 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 |
| 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 |
| 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 |
| 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 |
| 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 |

Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result.

Also no_cuda_graph flag enabled when running the quantize_bench

heuristic sweep results from the 4 problem sizes we care about:
P1722806148


**Next step:**
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68470488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants