-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered #160
Comments
However, the code works when M=1 |
hi @huanpengchu , we recommend accum dtype to use int32 :) |
and how to use FP8 matmul kernel (4090)?
but output |
hi @huanpengchu , thanks for reporting this! Following the changes in PR #133, which set the default memory scope from shared to shared.dyn, the FP8 Tensorcore has not been fully tested. As a result, we missed the registeration for FP8 ldmatrix operations on dynamic shared memory, leading to this issue. I just made a fix, checkout pr #162 |
you can install the latest update by |
I want to use INT8 matmul , and the code/output is as follows:
Code
And the error:
2024-08-29 10:43:56 [BitBLAS:INFO]: Auto detected target: nvidia/nvidia-a100 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 16], 'thread': [4, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 32], 'thread': [2, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 32], 'thread': [4, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 16], 'thread': [2, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 2, 16], 'thread': [2, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 16], 'thread': [1, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 8], 'thread': [8, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 16], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 16], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 8], 'thread': [4, 2, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 8], 'thread': [4, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 2, 32], 'thread': [1, 2, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 2, 32], 'thread': [2, 2, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 32], 'thread': [1, 1, 32], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [2, 1, 8], 'thread': [2, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 2, 32], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [8, 1, 32], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [4, 1, 4], 'thread': [4, 1, 4], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 1, 8], 'thread': [1, 1, 8], 'rstep': []} 2024-08-29 10:44:20 [BitBLAS:DEBUG]: Apply config {'block': [1, 2, 16], 'thread': [1, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 16], 'thread': [4, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 32], 'thread': [2, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 32], 'thread': [4, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 16], 'thread': [2, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 2, 16], 'thread': [2, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 16], 'thread': [1, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 8], 'thread': [8, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 16], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 16], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 8], 'thread': [4, 2, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 8], 'thread': [4, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 2, 32], 'thread': [1, 2, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 2, 32], 'thread': [2, 2, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 32], 'thread': [1, 1, 32], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [2, 1, 8], 'thread': [2, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 2, 32], 'thread': [4, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [8, 1, 32], 'thread': [8, 1, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [4, 1, 4], 'thread': [4, 1, 4], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.007 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 1, 8], 'thread': [1, 1, 8], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.006 ms 2024-08-29 10:44:25 [BitBLAS:INFO]: Evaluation with config {'block': [1, 2, 16], 'thread': [1, 2, 16], 'rstep': []} 2024-08-29 10:44:25 [BitBLAS:INFO]: Time cost of this config: 0.005 ms 2024-08-29 10:44:28 [BitBLAS:INFO]: Tile Dict: [16, 256] Shared memory exceeds the static capacity, use dynamic shared memory. 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 16], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 32], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 64], 'warp': [16, 16], 'rstep': [256], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 128], 'warp': [16, 32], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply config {'block': [16, 256], 'warp': [16, 64], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:28 [BitBLAS:DEBUG]: Apply schedule failed: Traceback (most recent call last): 3: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)>::AssignTypedLambda<tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}>(tvm::tir::{lambda(tvm::tir::Schedule, tvm::runtime::ObjectRef, tvm::runtime::String, bool)#14}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 2: tvm::tir::TracedScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 1: tvm::tir::ConcreteScheduleNode::Tensorize(tvm::tir::LoopRV const&, tvm::runtime::String const&, bool) 0: tvm::tir::TensorIntrin::Get(tvm::runtime::String, bool) File "/root/BitBLAS/3rdparty/tvm/src/tir/ir/function.cc", line 151 ValueError: TensorIntrin 'mma_i8i8f16_smooth_a_trans_b_smooth_b' is not registered 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 16], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 32], 'warp': [16, 16], 'rstep': [512], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 64], 'warp': [16, 16], 'rstep': [256], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 128], 'warp': [16, 32], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None 2024-08-29 10:44:51 [BitBLAS:DEBUG]: Apply config {'block': [16, 256], 'warp': [16, 64], 'rstep': [128], 'use_tc': True, 'vectorize': {'A_reindex_reindex': 16, 'B_reindex_reindex': 16}, 'pipeline_stage': 2} failed, artifact path is None Traceback (most recent call last): File "/workspace/chuhuanpeng/quant_cogview/pytorch_dit_int8/pytorch_dit_single/temp_bitblas.py", line 74, in <module> matmul = bitblas.Matmul(config=matmul_config) File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/general_matmul/__init__.py", line 292, in __init__ self.dispatch_tir(target, from_database, source_format, enable_tuning) File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/general_matmul/__init__.py", line 348, in dispatch_tir self.hardware_aware_finetune() File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune self.optimized_func = self.apply_fast_tuning( File "/usr/local/lib/python3.10/dist-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning self.pass_context = best.config.pass_context AttributeError: 'NoneType' object has no attribute 'config'
The text was updated successfully, but these errors were encountered: