Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faiss GPU: bfloat16 brute-force kNN support #4014

Closed
wants to merge 1 commit into from

Conversation

mdouze
Copy link
Contributor

@mdouze mdouze commented Nov 5, 2024

Summary:
This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (bfKnn).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION is set in StandardGpuResources.cpp. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723

Summary:
This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65459723

@mdouze
Copy link
Contributor Author

mdouze commented Nov 5, 2024

Related to #3862

mdouze pushed a commit to mdouze/faiss that referenced this pull request Nov 5, 2024
Summary:

This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723
@iotamudelta
Copy link
Contributor

For ROCm support, hip/hip_bf16.h needs to be included. This'll provide the hipified __hip_bfloat16 data type and operations as per amd_hip_bf16.h as well as packed __hip_bfloat162. Hope this helps.

wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request Nov 6, 2024
Summary:

This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should 

WARNING: The one thing this diff does not yet handle properly is header inclusion / compilation for GPUs older than Ampere. This will need to be fixed before landing (so that compiling with an older CUDA SDK or compiling for the Volta architecture will simply error out at runtime properly with lack of support, instead of failing to compile (?)

Differential Revision: D65459723
@gtwang01 gtwang01 added the GPU label Nov 7, 2024
wickedfoo pushed a commit to wickedfoo/faiss that referenced this pull request Nov 19, 2024
Summary:
Pull Request resolved: facebookresearch#4018

Pull Request resolved: facebookresearch#4014

This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723
@gtwang01 gtwang01 closed this Nov 19, 2024
facebook-github-bot pushed a commit that referenced this pull request Nov 20, 2024
Summary:
Pull Request resolved: #4018

Pull Request resolved: #4014

This diff adds support for bfloat16 vector/query data types with the GPU brute-force k-nearest neighbor function (`bfKnn`).

The change is largely just plumbing the new data type through the template hierarchy (so distances can be computed in bfloat16).

Of note, by design, all final distance results are produced in float32 regardless of input data type (float32, float16, bfloat16). This is because the true nearest neighbors in many data sets can often differ by only ~1000 float32 ULPs in terms of distance which will result in possible false equivalency. This seems to be one area where lossy compression/quantization thoughout does not work as well (and is also why `CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION` is set in `StandardGpuResources.cpp`. However, given that there is native bf16 x bf16 = fp32 tensor core support on Ampere+ architectures, the matrix multiplication itself should use them.

As bfloat16 support is quite lacking on AMD/ROCm (see [here](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Device_API_supported_by_HIP.html), very few bf16 functions implemented), bf16 functionality is completely disabled / not compiled for AMD ROCm.

Reviewed By: mdouze

Differential Revision: D65459723

fbshipit-source-id: 8a6aec843f7e37c205d95f2485442a26c402a3b0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants