[DEV] Transform Codebase from Azure to GitHub (#14)

* update codeql * fix uint32 zero issue * initial transparency. * enhance transparency. * rename transparency * dependabot fix * update transparency. * update plugin * remove redundant transparency * dsl benchmark scirpts * update submodule. * remove redundant code. * remove transparency * fix propagate map issue * implement in register dequantize config * optimize target * fix tag. * fix some issues on ampere game device * finetune with data distribution. * fill matmul benchmarking scripts * refactor use_async_copy to bool value * support af format * format fix * support propagate input transform for dequantization. * update requirements * update requirements.txt * update af4 related tests. * clean test * naive support for dynamic zeros * move to bitdistiller * implement lop3 with zeros cpp test * implement fast decoding with zeros * update zero generation support. * Bump transformers from 4.29.2 to 4.36.0 Bumps [transformers](https://github.com/huggingface/transformers) from 4.29.2 to 4.36.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.29.2...v4.36.0) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Bump pillow from 9.4.0 to 10.2.0 Bumps [pillow](https://github.com/python-pillow/Pillow) from 9.4.0 to 10.2.0. - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst) - [Commits](python-pillow/Pillow@9.4.0...10.2.0) --- updated-dependencies: - dependency-name: pillow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Bump tornado from 6.2 to 6.3.3 Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.2 to 6.3.3. - [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst) - [Commits](tornadoweb/tornado@v6.2.0...v6.3.3) --- updated-dependencies: - dependency-name: tornado dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Bump scipy from 1.5.3 to 1.11.1 Bumps [scipy](https://github.com/scipy/scipy) from 1.5.3 to 1.11.1. - [Release notes](https://github.com/scipy/scipy/releases) - [Commits](scipy/scipy@v1.5.3...v1.11.1) --- updated-dependencies: - dependency-name: scipy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Bump jinja2 from 3.1.2 to 3.1.3 Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Bump pygments from 2.2.0 to 2.15.0 Bumps [pygments](https://github.com/pygments/pygments) from 2.2.0 to 2.15.0. - [Release notes](https://github.com/pygments/pygments/releases) - [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES) - [Commits](pygments/pygments@2.2.0...2.15.0) --- updated-dependencies: - dependency-name: pygments dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Bump pygments from 2.13.0 to 2.15.0 Bumps [pygments](https://github.com/pygments/pygments) from 2.13.0 to 2.15.0. - [Release notes](https://github.com/pygments/pygments/releases) - [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES) - [Commits](pygments/pygments@2.13.0...2.15.0) --- updated-dependencies: - dependency-name: pygments dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * update requirements and matmul. * support fast decode for int8 related items * improve pass context * update benchmark related figures. * update benchmark readme * reorganize readme * refactor readme * update benchmark readme * refactor quant linear for bisect * update tvm submodule * fix blockIdx related * update bitditiller related. * update zero type related test * implement zero types support * implement zero types support * fix lop3 permuteta issue. * fix weight executor bug. * improve typing * resolve performance related items * add implementation for dequantization with dynamic symbolic * fix ladder transform related issues. * improve ladder permutation for dequantization * enhance dynamic symbolic for matmul_impl * improve support for dynamic symbolic * update tvm dependency * implement operator cache. * refactor print to logging * append setup.py and remove tvm pythonpath dependency. * update ignore * improve installation scripts * update scaling benchmark of 1bit * int8xint1 lop3 support. * replace with to_torch_func * license related fix * update contributing.md * autogptq support. * refactor docs * refactor * refactor docs * typo fix * implement disk cache * refactor codegen to get_source * support get weight shape. * Update dependabot.yml * Update dependabot.yml * Update dependabot.yml * Update dependabot.yml * Update dependabot.yml * Update requirements.txt * Update requirements.txt * Update requirements.txt * refactor propagate into transform kind * Update dependabot.yml * implement scale and zero layout propagation * typo fix * refactor codes * fix performance issue of dequantize propagate * refactor print * fix gemv scale bugs * refactor ops configs * improve tensor_adapter * implement trick wrapper for integration * code refactor * SUPPORT.md commit * spell check * improve for linting * overal lint improvements * Add copyright and license information * improve contributing * Fix PYTHONPATH export in installation script and update BitBLAS package * Update benchmark section in README.md * Update performance benchmarks and integration details * Fix typo in README.md * Refactor index map logging in matmul_analysis.py * Add .ruff_cache to .gitignore * Add _tir_u32_to_f4_to_f16 function to quantization module * Update performance benchmark images * Update benchmark configurations * Update benchmark information in README.md * Refactor code for improved performance and readability * convolution impl support * Refactor convolution2d_impl.py and test_auto_normalized_tensorcore.py * Fix code formatting and remove unnecessary code * Update TensorCore GEMM Performance Comparison * Update TensorCore GEMM performance comparison on A100 and RTX4090 * Refactor propagate_inputs method in TensorCorePolicy * Fix BitBLAS import and remove debug print statements * Add end-to-end integration with Quantize Inference Kernel for AutoGPTQ and vLLM * Fix import order and handle exception in benchmark scripts * Update TVM subproject commit * Update TileDevice class names in bitblas package * Update imports in roller module * Update images * Update images * Update end2end_llama_13b_vllm.png * Update trademark and acknowledgement section * Update benchmark images for consistent GEMM operations * Add test case for decoding UInt4 to Float16 with scaling and zeros quantized * Remove benchmarking code for int4 on a specific target * Update image files and add new functions for quantization and rasterization * fix rescale and original lop3. * Add integration example of FasterTransformers with BitBLAS * Update integration example of FasterTransformer with BitBLAS * Update requirements-dev.txt and requirements.txt * Add LLVM download and extraction functionality * Update FasterTransformer.gif * Update BitBLAS version and requirements * Update BitBLAS import paths and add support for installing and developing TVM * Add GPU intrinsics module for BitBLAS * Update requirements-dev.txt and requirements.txt * Refactor import paths in BitBLAS GPU modules * Update installation guide in Installation.md * Refactor MatmulConfig class in matmul.py for improved readability and maintainability * Refactor MatmulConfig class in matmul.py for improved readability and maintainability * Refactor MatmulConfig class in matmul.py for improved readability and maintainability * Update installation guide and QuickStart link in README.md * Update installation guide and QuickStart link in README.md * Append Default Schedule Fallback * Refactor requirements-dev.txt and fix newline issue in arch_base.py * Fix typo in check_mit_license.sh * imrpove the target detection. * Improve target detection and fix typos in code * Fix auto-inline spacing issue in MatmulTensorizationMMAWithDequantizeInfo class * Improve target detection and fix typos in code * transform to submit * Add support for weight_dtype transformation in MatmulWeightOnlyDequantizeConfig * Update zeros_type to zeros_mode in code blocks * update README * update README * Fix import errors and update paths in code * Update variable names in test_bitblas_linear.py and __init__.py * Update imports and add new function in quantization and cache modules * Update README with support matrix table * Update support matrix table and benchmark configurations * Update support matrix table and benchmark configurations * Update support matrix table and benchmark configurations * Update support matrix table and benchmark configurations * Update support matrix table and benchmark configurations * Update import statements and add new functions in quantization and cache modules * Fix default dynamic range for M in MatmulConfig * Update support matrix table with new tested platforms and Out_dtype column * Refactor code for mixed-precision matrix multiplication and update support matrix table * Refactor code for mixed-precision matrix multiplication and update support matrix table * Update MatmulConfig initialization in QuickStart.md * Update support matrix table with new tested platforms and INT32/FP16/INT8 support * Refactor code for mixed-precision matrix multiplication and update support matrix table * Update link to code implementation in QuickStart.md * Disable tuning for initial bitblas operator creation * Update linear transformation description in PythonAPI.md * Update MatmulConfig in PythonAPI.md * convert af format to nf * Enable hardware-aware tuning for bitblas operators * Refactor code for mixed-precision matrix multiplication and update support matrix table * Update support matrix table with new tested platforms and INT32/FP16/INT8 support * Update OperatorConfig.md with matrix multiplication configuration details * code refactor * Fix capitalization in QuickStart.md * update ReadME * Refactor setup.py to remove unnecessary code and improve readability * refactor infeatures to infeatures * update README.md * Fix incorrect data type mapping in general_matmul.py * update doc * Refactor variable names in bitblas_linear.py and bitblas_quant_linear.py * uncomments some case * Add BITBLAS_DATABASE_PATH constant to OperatorCache and update load_global_ops_cache function * Refactor variable names in bitblas_linear.py and bitblas_quant_linear.py * Refactor variable names in bitblas_linear.py and bitblas_quant_linear.py * Update dependencies in requirements-dev.txt and requirements.txt * Refactor variable names in bitblas_linear.py and bitblas_quant_linear.py * Fix BITBLAS_DATABASE_PATH constant assignment in OperatorCache * Refactor variable names in bitblas_linear.py and bitblas_quant_linear.py * Refactor variable names in bitblas_linear.py and bitblas_quant_linear.py * update install * Refactor variable names in setup.py and build_tvm function * append linear benchmark scripts * simple bug fix * Update BitBLAS installation instructions for Ubuntu 20.04 * Refactor variable names and add output print statements for debugging * Refactor variable names and update dependencies * Update BitBLAS installation instructions for Ubuntu 20.04 and add note about Linux support * Refactor logging handler and set log level in BitBLAS module * Bump version to 0.0.1 --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Lingxiao Ma <[email protected]>
microsoft · Apr 15, 2024 · eed0ea2 · eed0ea2
1 parent de49655
commit eed0ea2
Show file tree

Hide file tree

Showing 133 changed files with 9,786 additions and 4,713 deletions.
diff --git a/.gitignore b/.gitignore
@@ -69,4 +69,10 @@ models/frozenmodels/
 .pytest_cache
 
 # .hypothesis
-.hypothesis
+.hypothesis
+
+# .ruff_cache
+.ruff_cache
+
+# .bitblas_database
+.bitblas_database
diff --git a/3rdparty/.gitignore b/3rdparty/.gitignore
@@ -0,0 +1,3 @@
+clang*
+
+llvm*
diff --git a/3rdparty/tvm b/3rdparty/tvm
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -27,6 +27,8 @@ Please ask questions in issues.
 
 All pull requests are super welcomed and greatly appreciated! Issues in need of a solution are marked with a [`♥ help`](https://github.com/ianstormtaylor/BitBLAS/issues?q=is%3Aissue+is%3Aopen+label%3A%22%E2%99%A5+help%22) label if you're looking for somewhere to start.
 
+Please run `./format.sh` before submitting a pull request to make sure that your code is formatted correctly.
+
 Please include tests and docs with every pull request!
 
 ## Repository Setup

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,4 @@
+recursive-include 3rdparty/tvm *
+recursive-exclude 3rdparty/tvm/build *
+recursive-exclude 3rdparty/clang* * 
+recursive-exclude 3rdparty/llvm* *
diff --git a/README.md b/README.md
@@ -1,38 +1,83 @@
 # BitBLAS
 
-BitBLAS is a lightweight framework designed to generate high-performance CUDA/HIP code for BLAS operators, featuring swizzling and layout propagation. It achieves performance comparable to vendor libraries across various platforms and hardware. BitBLAS aims to assist algorithm developers working on projects like BitNet, GPTQ, and similar endeavors by enabling the rapid implementation of accelerated kernels and their efficient deployment.
+BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$.
+BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the $W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{INT4}A_{FP16}$ in [GPTQ](https://arxiv.org/abs/2210.17323), the $W_{INT2}A_{FP16}$ in [BitDistiller](https://arxiv.org/abs/2402.10631), the $W_{INT1}A_{INT8}$ and $W_{INT2}A_{INT8}$ in [BitNet](https://arxiv.org/abs/2310.11453) and [BitNet-b1.58](https://arxiv.org/abs/2402.17764). BitBLAS is based on techniques from our accepted submission at OSDI'24.
+
 
 Some of the key features of BitBLAS include:
-  - Auto Tensorize compute with TensorCore-like hardware instructions.
-  - High Performance (Not only FP16xFP16, INT8xINT8, but also FP16xINT4/2/1, INT8xINT4/2/1).
-  - With the flexible DSL (TIR Script) to effortlessly craft domain-specific kernels for your situations.
-  - Support with dynamic symbolic throuth tvm unity -> generate source code with dynamic shape.
-  - BitBLAS first proposed int8xint1 gemv/gemm with 10x/2x speedup over float16xfloat16 on A100, please checkout [op_benchmark_a100_int1_scaling](images/figures/op_benchmark_a100_int1_scaling.png) for detailed input scaling benchmark results.
+  - High performance matrix multiplication for both GEMV (e.g., the single batch auto-regressive decode phase in LLM) and GEMM (e.g., the batched auto-regressive decode phase and the prefill phase in LLM):
+    - $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication including FP16xINT4/2/1, INT8xINT4/2/1, etc. Please checkout [support matrix](#support-matrix) for detailed data types support.
+    - Matrix multiplication like FP16xFP16 and INT8xINT8.
+  - Auto-Tensorization for TensorCore-like hardware instructions.
+  - Implemented [integration](./integration/) to [PyTorch](https://pytorch.org/), [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) and [vLLM](https://github.com/vllm-project/vllm) for LLM deployment. Please checkout [benchmark summary](#benchmark-summary) for detailed end2end LLM inference performance.
+  - BitBLAS first implemented $W_{INT1}A_{INT8}$ GEMV/GEMM with 10x/2x speedup over $W_{FP16}A_{FP16}$ on A100, please checkout [op_benchmark_a100_int1_scaling](images/figures/op_benchmark_a100_int1_scaling.png) for detailed benchmark results.
+  - Support customizing mixed-precision DNN operations for your specific scenarios via the flexible DSL (TIR Script).
+
+## Integration Example of FasterTransformer with BitBLAS
+![FasterTransformer Integration](images/gif/FasterTransformer.gif)
+
+
+## Benchmark Summary
+
+BitBLAS achieves exceptional performance across a variety of computational patterns. Below are selected results showcasing its capabilities:
+
+- End2End Integration with Quantize Inference Kernel for AutoGPTQ and vLLM.
+
+  <div>
+    <img src="./images/figures/end2end_llama_13b_auto_gptq.png" alt="AutoGPTQ end2end performance of llama13b on A100" style="width: 24%;" />
+    <img src="./images/figures/end2end_llama_70b_auto_gptq.png" alt="AutoGPTQ end2end performance of llama13b on A100" style="width: 24%;" />
+    <img src="./images/figures/end2end_llama_13b_vllm.png" alt="vLLM end2end performance of llama13b on A100" style="width: 24%;" />
+    <img src="./images/figures/end2end_llama_70B_vllm.png" alt="vLLM end2end performance of llama13b on A100" style="width: 24%;" />
+  </div>
+
+- Weight Only Matmul performance on A100
 
+  <div>
+    <img src="./images/figures/op_benchmark_a100_wq_gemv_e7.png" alt="gemm weight only performance on A100" style="width: 49%;" />
+    <img src="./images/figures/op_benchmark_a100_wq_gemm_e7.png" alt="gemm weight only performance on A100" style="width: 49%;" />
+  </div>
 
-## Benchmark
-BitBLAS can achieve optimal performance across various compute patterns:
 
-- GTX 3090
-  - FLOAT16xFLOAT16 with TensorCore ![3090-gemm-fp16](./images/figures/op_benchmark_3090_fp16_gemm.png)
-  - INT8xINT8 with TensorCore ![3090-gemm-s8](./images/figures/op_benchmark_3090_s8_gemm.png)
-  - FLOAT16xAF4(LUT4) GEMV ![3090-af4-gemv](./images/figures/op_benchmark_3090_af4_gemv.png)
-  - FLOAT16xAF4(LUT4) with TensorCore ![3090-af4-gemm](./images/figures/op_benchmark_3090_af4_gemm.png)
+- TensorCore FP16/INT8 GEMM Performance Vs. Vendor Library on A100 and RTX4090
 
-- A100
-  - WeightOnly GEMV ![a100-wq-gemv](./images/figures/op_benchmark_a100_wq_gemv.png)
-  - WeightOnly GEMM with TensorCore ![a100-wq-gemm](./images/figures/op_benchmark_a100_wq_gemm.png)
+  <div>
+    <img src="./images/figures/op_benchmark_consistent_gemm_fp16.png" alt="gemm fp16 performance on 4090 and a100" style="width: 49%;" />
+    <img src="./images/figures/op_benchmark_consistent_gemm_int8.png" alt="gemm int8 performance on 4090 and a100" style="width: 49%;" />
+  </div>
 
-See more details in our [benchmark](./benchmark) directory.
+For more detailed information on benchmark sets with other formats (NF4/FP4) and other devices (GTX 3090), please refer to the [benchmark](./benchmark/README.md).
+
+## Support Matrix
+
+| **A_dtype** | **W_dtype** | **Accum_dtype** | **Out_dtype** | **BitBLAS<br>Support** | **Tested<br>Platform** |
+|:-----------:|:-----------:|:---------------:|:---------------:|:----------------------:|:----------------------:|
+|     FP16    |     FP16    |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     FP16    |   FP4_E2M1  |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     FP16    |     INT8    |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     FP16    |     INT4    |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     FP16    |     INT2    |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     FP16    |     INT1    |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     FP16    |     NF4     |       FP16      |       FP16      |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     INT8    |     INT8    |      INT32      |    FP32/INT32/FP16/INT8   |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     INT8    |     INT4    |      INT32      |    FP32/INT32/FP16/INT8   |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     INT8    |     INT2    |      INT32      |    FP32/INT32/FP16/INT8   |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+|     INT8    |     INT1    |      INT32      |    FP32/INT32/FP16/INT8   |          **√**         |   V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
+
+We are continuously expanding the support matrix. If you have any specific requirements, please feel free to open an issue or PR.
 
 ## Getting Started
 
-- Installation:
-  To manually install BitBLAS, please checkout `maint/scripts/installation.sh`. Also Make sure you already have the cuda toolkit (version >= 11) installed in the system. Or you can install from `python setup.py install` or `pip install .` in the root directory. 
+- [Installation](./docs/Installation.md):
+  To install BitBLAS, please checkout the document [installation](./docs/Installation.md). Also Make sure you already have the cuda toolkit (version >= 11) installed in the system. Or you can easily install from `pip install bitblas` in the root directory. 
+
+- [QuickStart](./docs/QuickStart.md): BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication:
+  - ```bitblas.Matmul``` implements the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication of $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$.
+  - ```bitblas.Linear``` is a PyTorch ```nn.Linear```-like module to support a Linear of mixed-precision.
 
-- [QuickStart](./docs/QuickStart.md): We provide two primary ways to do the code generation: using a high-level DSL (TensorIR Script), or using packed Operators, from the quick start guide, you can learn how to use BitBLAS to generate high performance kernels with both methods.
+- [Integration](./integration/): Explore how BitBLAS seamlessly integrates with LLM deployment frameworks through our examples. Discover the ease of integrating BitBLAS with PyTorch, AutoGPTQ, and vLLM in the 3rd-party integration examples.
+
+- [Customization](./docs/ExtendOperatorsWithDSL.md): BitBLAS supports implementing customized mixed-precision DNN operations rather than matrix multiplication with the flexible DSL (TIR Script).
 
-- [3rd Party Integration](./integration/): BitBLAS can also be easily integrated to other frameworks, the integration provides some examples of integrating BitBLAS with PyTorch, AutoGPTQ and vLLM.
 
 ## Contributing
 
@@ -46,9 +91,3 @@ This project has adopted the Microsoft Open Source Code of Conduct. For more inf
 
 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
 
-## Acknowledgement
-
-We learned a lot from the following projects.
-
-- [Apache TVM](https://github.com/apache/tvm): BitBLAS havs adopted TensorIR as our DSL. Additionally, we have customized TVM from the unity branch to incorporate specific features that were required for our project.
-- [Microsoft Roller](https://github.com/microsoft/nnfusion/tree/roller): The design and algo inspiration of hardware aware tuning in BitBLAS comes from Roller,.
diff --git a/SUPPORT.md b/SUPPORT.md
@@ -1,25 +1,29 @@
-# TODO: The maintainer of this repo has not yet edited this file
+# Support
 
-**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
+Welcome to the BitBLAS support page! BitBLAS is a cutting-edge framework designed for generating high-performance CUDA/HIP code for BLAS operators. Whether you're working on projects like BitNet, GPTQ, or similar, BitBLAS is here to accelerate your development with its robust features.
 
-- **No CSS support:** Fill out this template with information about how to file issues and get help.
-- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
-- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
+## How to File Issues and Get Help
 
-*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
+### Reporting Bugs or Requesting Features
 
-# Support
+If you encounter a bug or have a feature request, we encourage you to file an issue through our GitHub Issues page. Please follow these steps:
+
+1. **Search Existing Issues**: Before creating a new issue, please search the existing ones to avoid duplicates.
+2. **Create a New Issue**: If your issue is new, go ahead and file it as a new issue. Provide as much detail as possible to help us understand and address it efficiently.
+
+### Seeking Help and Questions
+
+For questions and help with using BitBLAS, we offer the following channels:
+
+- **GitHub Discussions**: For community support, sharing ideas, and discussing best practices, please visit our [GitHub Discussions](https://github.com/YOUR_REPO/discussions).
+- **Stack Overflow**: Use the tag `BitBLAS` when posting questions. This is monitored by our team and the community.
 
-## How to file issues and get help  
+## Microsoft Support Policy
 
-This project uses GitHub Issues to track bugs and feature requests. Please search the existing 
-issues before filing new issues to avoid duplicates.  For new issues, file your bug or 
-feature request as a new Issue.
+Support for BitBLAS is primarily provided through the above-mentioned community channels. We strive to address issues and questions in a timely manner, leveraging the collective knowledge and experience of the BitBLAS community.
 
-For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE 
-FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
-CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
+## Contributing to BitBLAS
 
-## Microsoft Support Policy  
+We warmly welcome contributions to the BitBLAS project. Whether it's improving the documentation, adding new features, or fixing bugs, your contributions are invaluable to us. Please refer to our [CONTRIBUTING.md](./CONTRIBUTING.md) file for more details on how to contribute.
 
-Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
+Before submitting a pull request, you may need to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. The CLA process is straightforward and only needs to be completed once.