You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to use Qwen1.5-0.5B-Chat as draft for Qwen1.5-7B-Chat. But it turned out not giving any response in speculative_mode="small_draft". I also tried to use EAGLE-Qwen2-7B-Instruct and set speculative_mode="eagle". It still gave no outputs. I switched to llama-2-7b-chat-hf and tried again, there were still no output. I've checked the process, it seems stuck in "tvm_ffi_ctypes\packed_func.py". After I added "max_num_sequence=spec_draft_length+2 " in "engine_config", it turned into an error "TVMError: Check failed: draft_token_indices->size() == num_sequence (2 vs. 1) :".
To Reproduce
Steps to reproduce the behavior:
1.Download Qwen models or llama models
2.Quantization, gen_config and compile
3.run following sample to reproduce
`from mlc_llm.serve.sync_engine import EngineConfig, SyncMLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig
output_texts, _ = engine.generate(
prompts[:num_requests], GenerationConfig(temperature=0.0, top_p=0, seed=42, max_tokens=256, stop_token_ids=[2], n=1)
)
for req_id, outputs in enumerate(output_texts):
print(f"Prompt {req_id}: {prompts[req_id]}")
if len(outputs) == 1:
print(f"Output {req_id}:{outputs[0]}\n")
else:
for i, output in enumerate(outputs):
print(f"Output {req_id}({i}):{output}\n")`
The error message is:
(mlc-chat-env) C:\Users\Administrator\Desktop>python mlc_spec.py
[2024-11-04 03:04:39] INFO auto_device.py:88: Not found device: cuda:0
[2024-11-04 03:04:40] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-04 03:04:41] INFO auto_device.py:88: Not found device: metal:0
[2024-11-04 03:04:43] INFO auto_device.py:79: Found device: vulkan:0
[2024-11-04 03:04:44] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-04 03:04:44] INFO auto_device.py:35: Using device: vulkan:0
[2024-11-04 03:04:44] INFO engine_base.py:143: Using library model: dist/libs/Llama-2-7b-chat-q4f16_1-vulkan.dll
[2024-11-04 03:04:44] INFO engine_base.py:143: Using library model: dist/libs/Eagle-Llama-2-7b-chat-q4f16_1-vulkan.dll
[2024-11-04 03:04:44] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-11-04 03:04:44] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-11-04 03:04:44] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "local", max batch size 7 is specified by user, max KV cache token capacity will be set to 768, prefill chunk size will be set to 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "interactive", max batch size 7 is specified by user, max KV cache token capacity will be set to 768, prefill chunk size will be set to 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "server", max batch size 7 is specified by user, max KV cache token capacity will be set to 4989, prefill chunk size will be set to 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:769: The actual engine mode is "local". So max batch size is 7, max KV cache token capacity is 768, prefill chunk size is 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:774: Estimated total single GPU memory usage: 4568.668 MB (Parameters: 3812.023 MB. KVCache: 540.390 MB. Temporary buffer: 216.255 MB). The actual usage might be slightly larger than the estimated number.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\mlc_spec.py", line 27, in
output_texts, _ = engine.generate(
^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 283, in generate
self.step()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 351, in step
self._ffi"step"
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 245, in call
raise_last_ffi_error()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\mlc-llm\cpp\serve\logit_processor.cc", line 126
TVMError: Check failed: draft_token_indices->size() == num_sequence (2 vs. 1) :
Expected behavior
The model streams the output to the provided prompt.
How you installed MLC-LLM (conda, source):conda with pip
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.11
GPU driver version (if applicable):-
CUDA/cuDNN version (if applicable):-
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: NOT-FOUND
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
TVM_LOG_BEFORE_THROW: OFF
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_MSCCL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 2685d6ace64c30a077c1b3f6893d2e38589be7bb
USE_VULKAN: ON
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-09-07 15:18:06 -0400
USE_HIPBLAS: OFF
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: OFF
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 18.1.8
USE_MRVL: OFF
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_FLASHINFER:
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_NVSHMEM: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe
HIDE_PRIVATE_SYMBOLS: OFF
Any other relevant information: All models were downloaded in local and could work fine without speculative decoding.
Additional context
When I switch to "small_draft", the error message changed to following context.
(mlc-chat-env) C:\Users\Administrator\Desktop>python mlc_qs.py
[2024-11-04 03:06:21] INFO auto_device.py:88: Not found device: cuda:0
[2024-11-04 03:06:23] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-04 03:06:24] INFO auto_device.py:88: Not found device: metal:0
[2024-11-04 03:06:25] INFO auto_device.py:79: Found device: vulkan:0
[2024-11-04 03:06:27] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-04 03:06:27] INFO auto_device.py:35: Using device: vulkan:0
[2024-11-04 03:06:27] INFO engine_base.py:143: Using library model: dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll
[2024-11-04 03:06:27] INFO engine_base.py:143: Using library model: dist\libs\Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll
[2024-11-04 03:06:27] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-11-04 03:06:27] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2024-11-04 03:06:27] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "local", max batch size 7 is specified by user, max KV cache token capacity will be set to 1024, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "interactive", max batch size 7 is specified by user, max KV cache token capacity will be set to 1024, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "server", max batch size 7 is specified by user, max KV cache token capacity will be set to 7168, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:769: The actual engine mode is "server". So max batch size is 7, max KV cache token capacity is 7168, prefill chunk size is 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:774: Estimated total single GPU memory usage: 3241.175 MB (Parameters: 498.145 MB. KVCache: 1456.284 MB. Temporary buffer: 1286.746 MB). The actual usage might be slightly larger than the estimated number.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\mlc_qs.py", line 667, in
test_engine_basic("dist/Qwen1.5-0.5B-Chat-q4f16_1-MLC","dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll",['dist\Qwen1.5-0.5B-Chat-q4f16_1-MLC','dist\libs\Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll'])
File "C:\Users\Administrator\Desktop\mlc_qs.py", line 121, in test_engine_basic
engine.step()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 351, in step
self._ffi"step"
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 245, in call
raise_last_ffi_error()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\mlc-llm\cpp\serve\engine_actions\batch_draft.cc", line 151
InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false:
The text was updated successfully, but these errors were encountered:
🐛 Bug
I tried to use Qwen1.5-0.5B-Chat as draft for Qwen1.5-7B-Chat. But it turned out not giving any response in speculative_mode="small_draft". I also tried to use EAGLE-Qwen2-7B-Instruct and set speculative_mode="eagle". It still gave no outputs. I switched to llama-2-7b-chat-hf and tried again, there were still no output. I've checked the process, it seems stuck in "tvm_ffi_ctypes\packed_func.py". After I added "max_num_sequence=spec_draft_length+2 " in "engine_config", it turned into an error "TVMError: Check failed: draft_token_indices->size() == num_sequence (2 vs. 1) :".
To Reproduce
Steps to reproduce the behavior:
1.Download Qwen models or llama models
2.Quantization, gen_config and compile
3.run following sample to reproduce
`from mlc_llm.serve.sync_engine import EngineConfig, SyncMLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig
prompts = ["what is the meaning of life?"]
Create engine
model = "dist/Llama-2-7b-chat-q4f16_1-MLC"
model_lib = "dist/libs/Llama-2-7b-chat-q4f16_1-vulkan.dll"
small_model = "dist/Eagle-Llama-2-7b-chat-q4f16_1-MLC" #"dist/Qwen1.5-0.5B-Chat-q4f16_1-MLC"
small_model_lib = (
"dist/libs/Eagle-Llama-2-7b-chat-q4f16_1-vulkan.dll" #"dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll"
)
engine = SyncMLCEngine(
model=model,
model_lib=model_lib,
mode="local", #local
engine_config=EngineConfig(
additional_models=[(small_model, small_model_lib)],
spec_draft_length=5,
max_num_sequence=7,
speculative_mode="eagle",),
)
num_requests = 1
Generate output.
output_texts, _ = engine.generate(
prompts[:num_requests], GenerationConfig(temperature=0.0, top_p=0, seed=42, max_tokens=256, stop_token_ids=[2], n=1)
)
for req_id, outputs in enumerate(output_texts):
print(f"Prompt {req_id}: {prompts[req_id]}")
if len(outputs) == 1:
print(f"Output {req_id}:{outputs[0]}\n")
else:
for i, output in enumerate(outputs):
print(f"Output {req_id}({i}):{output}\n")`
(mlc-chat-env) C:\Users\Administrator\Desktop>python mlc_spec.py
[2024-11-04 03:04:39] INFO auto_device.py:88: Not found device: cuda:0
[2024-11-04 03:04:40] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-04 03:04:41] INFO auto_device.py:88: Not found device: metal:0
[2024-11-04 03:04:43] INFO auto_device.py:79: Found device: vulkan:0
[2024-11-04 03:04:44] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-04 03:04:44] INFO auto_device.py:35: Using device: vulkan:0
[2024-11-04 03:04:44] INFO engine_base.py:143: Using library model: dist/libs/Llama-2-7b-chat-q4f16_1-vulkan.dll
[2024-11-04 03:04:44] INFO engine_base.py:143: Using library model: dist/libs/Eagle-Llama-2-7b-chat-q4f16_1-vulkan.dll
[2024-11-04 03:04:44] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-11-04 03:04:44] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-11-04 03:04:44] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "local", max batch size 7 is specified by user, max KV cache token capacity will be set to 768, prefill chunk size will be set to 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "interactive", max batch size 7 is specified by user, max KV cache token capacity will be set to 768, prefill chunk size will be set to 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "server", max batch size 7 is specified by user, max KV cache token capacity will be set to 4989, prefill chunk size will be set to 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:769: The actual engine mode is "local". So max batch size is 7, max KV cache token capacity is 768, prefill chunk size is 768.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:774: Estimated total single GPU memory usage: 4568.668 MB (Parameters: 3812.023 MB. KVCache: 540.390 MB. Temporary buffer: 216.255 MB). The actual usage might be slightly larger than the estimated number.
[03:04:44] D:\a\package\package\mlc-llm\cpp\serve\engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\mlc_spec.py", line 27, in
output_texts, _ = engine.generate(
^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 283, in generate
self.step()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 351, in step
self._ffi"step"
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 245, in call
raise_last_ffi_error()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\mlc-llm\cpp\serve\logit_processor.cc", line 126
TVMError: Check failed: draft_token_indices->size() == num_sequence (2 vs. 1) :
Expected behavior
The model streams the output to the provided prompt.
Environment
conda
, source):conda with pippip
, source): pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):USE_NVTX: OFFUSE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: NOT-FOUND
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
TVM_LOG_BEFORE_THROW: OFF
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_MSCCL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 2685d6ace64c30a077c1b3f6893d2e38589be7bb
USE_VULKAN: ON
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-09-07 15:18:06 -0400
USE_HIPBLAS: OFF
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: OFF
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 18.1.8
USE_MRVL: OFF
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_FLASHINFER:
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_NVSHMEM: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe
HIDE_PRIVATE_SYMBOLS: OFF
Additional context
When I switch to "small_draft", the error message changed to following context.
(mlc-chat-env) C:\Users\Administrator\Desktop>python mlc_qs.py
[2024-11-04 03:06:21] INFO auto_device.py:88: Not found device: cuda:0
[2024-11-04 03:06:23] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-04 03:06:24] INFO auto_device.py:88: Not found device: metal:0
[2024-11-04 03:06:25] INFO auto_device.py:79: Found device: vulkan:0
[2024-11-04 03:06:27] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-04 03:06:27] INFO auto_device.py:35: Using device: vulkan:0
[2024-11-04 03:06:27] INFO engine_base.py:143: Using library model: dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll
[2024-11-04 03:06:27] INFO engine_base.py:143: Using library model: dist\libs\Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll
[2024-11-04 03:06:27] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-11-04 03:06:27] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2024-11-04 03:06:27] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "local", max batch size 7 is specified by user, max KV cache token capacity will be set to 1024, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "interactive", max batch size 7 is specified by user, max KV cache token capacity will be set to 1024, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "server", max batch size 7 is specified by user, max KV cache token capacity will be set to 7168, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:769: The actual engine mode is "server". So max batch size is 7, max KV cache token capacity is 7168, prefill chunk size is 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:774: Estimated total single GPU memory usage: 3241.175 MB (Parameters: 498.145 MB. KVCache: 1456.284 MB. Temporary buffer: 1286.746 MB). The actual usage might be slightly larger than the estimated number.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\mlc_qs.py", line 667, in
test_engine_basic("dist/Qwen1.5-0.5B-Chat-q4f16_1-MLC","dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll",['dist\Qwen1.5-0.5B-Chat-q4f16_1-MLC','dist\libs\Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll'])
File "C:\Users\Administrator\Desktop\mlc_qs.py", line 121, in test_engine_basic
engine.step()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 351, in step
self._ffi"step"
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 245, in call
raise_last_ffi_error()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\mlc-llm\cpp\serve\engine_actions\batch_draft.cc", line 151
InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false:
The text was updated successfully, but these errors were encountered: