Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Speculative decoding doesn't work on Vulkan (AMD iGPU) #3011

Open
SkyHeroesS opened this issue Nov 4, 2024 · 0 comments
Open

[Bug] Speculative decoding doesn't work on Vulkan (AMD iGPU) #3011

SkyHeroesS opened this issue Nov 4, 2024 · 0 comments
Labels
bug Confirmed bugs

Comments

@SkyHeroesS
Copy link

🐛 Bug

I tried to use Qwen1.5-0.5B-Chat as draft for Qwen1.5-7B-Chat. But it turned out not giving any response in speculative_mode="small_draft". I also tried to use EAGLE-Qwen2-7B-Instruct and set speculative_mode="eagle". It still gave no outputs. I switched to llama-2-7b-chat-hf and tried again, there were still no output. I've checked the process, it seems stuck in "tvm_ffi_ctypes\packed_func.py". After I added "max_num_sequence=spec_draft_length+2 " in "engine_config", it turned into an error "TVMError: Check failed: draft_token_indices->size() == num_sequence (2 vs. 1) :".

To Reproduce

Steps to reproduce the behavior:

1.Download Qwen models or llama models
2.Quantization, gen_config and compile
3.run following sample to reproduce

`from mlc_llm.serve.sync_engine import EngineConfig, SyncMLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

prompts = ["what is the meaning of life?"]

Create engine

model = "dist/Llama-2-7b-chat-q4f16_1-MLC"
model_lib = "dist/libs/Llama-2-7b-chat-q4f16_1-vulkan.dll"
small_model = "dist/Eagle-Llama-2-7b-chat-q4f16_1-MLC" #"dist/Qwen1.5-0.5B-Chat-q4f16_1-MLC"
small_model_lib = (
"dist/libs/Eagle-Llama-2-7b-chat-q4f16_1-vulkan.dll" #"dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll"
)
engine = SyncMLCEngine(
model=model,
model_lib=model_lib,
mode="local", #local
engine_config=EngineConfig(
additional_models=[(small_model, small_model_lib)],
spec_draft_length=5,
max_num_sequence=7,
speculative_mode="eagle",),
)

num_requests = 1

Generate output.

output_texts, _ = engine.generate(
prompts[:num_requests], GenerationConfig(temperature=0.0, top_p=0, seed=42, max_tokens=256, stop_token_ids=[2], n=1)
)
for req_id, outputs in enumerate(output_texts):
print(f"Prompt {req_id}: {prompts[req_id]}")
if len(outputs) == 1:
print(f"Output {req_id}:{outputs[0]}\n")
else:
for i, output in enumerate(outputs):
print(f"Output {req_id}({i}):{output}\n")`

  1. The error message is:
    (mlc-chat-env) C:\Users\Administrator\Desktop>python mlc_spec.py
    [2024-11-04 03:04:39] INFO auto_device.py:88: Not found device: cuda:0
    [2024-11-04 03:04:40] INFO auto_device.py:88: Not found device: rocm:0
    [2024-11-04 03:04:41] INFO auto_device.py:88: Not found device: metal:0
    [2024-11-04 03:04:43] INFO auto_device.py:79: Found device: vulkan:0
    [2024-11-04 03:04:44] INFO auto_device.py:88: Not found device: opencl:0
    [2024-11-04 03:04:44] INFO auto_device.py:35: Using device: vulkan:0
    [2024-11-04 03:04:44] INFO engine_base.py:143: Using library model: dist/libs/Llama-2-7b-chat-q4f16_1-vulkan.dll
    [2024-11-04 03:04:44] INFO engine_base.py:143: Using library model: dist/libs/Eagle-Llama-2-7b-chat-q4f16_1-vulkan.dll
    [2024-11-04 03:04:44] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
    [2024-11-04 03:04:44] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
    [2024-11-04 03:04:44] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
    [03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "local", max batch size 7 is specified by user, max KV cache token capacity will be set to 768, prefill chunk size will be set to 768.
    [03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "interactive", max batch size 7 is specified by user, max KV cache token capacity will be set to 768, prefill chunk size will be set to 768.
    [03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "server", max batch size 7 is specified by user, max KV cache token capacity will be set to 4989, prefill chunk size will be set to 768.
    [03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:769: The actual engine mode is "local". So max batch size is 7, max KV cache token capacity is 768, prefill chunk size is 768.
    [03:04:44] D:\a\package\package\mlc-llm\cpp\serve\config.cc:774: Estimated total single GPU memory usage: 4568.668 MB (Parameters: 3812.023 MB. KVCache: 540.390 MB. Temporary buffer: 216.255 MB). The actual usage might be slightly larger than the estimated number.
    [03:04:44] D:\a\package\package\mlc-llm\cpp\serve\engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
    Traceback (most recent call last):
    File "C:\Users\Administrator\Desktop\mlc_spec.py", line 27, in
    output_texts, _ = engine.generate(
    ^^^^^^^^^^^^^^^^
    File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 283, in generate
    self.step()
    File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 351, in step
    self._ffi"step"
    File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 245, in call
    raise_last_ffi_error()
    File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
    raise py_err
    tvm._ffi.base.TVMError: Traceback (most recent call last):
    File "D:\a\package\package\mlc-llm\cpp\serve\logit_processor.cc", line 126
    TVMError: Check failed: draft_token_indices->size() == num_sequence (2 vs. 1) :

Expected behavior

The model streams the output to the provided prompt.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA):Vulkan
  • Operating system (e.g. Ubuntu/Windows/MacOS/...):Windows
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...):Radeon 880M
  • How you installed MLC-LLM (conda, source):conda with pip
  • How you installed TVM-Unity (pip, source): pip
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable):-
  • CUDA/cuDNN version (if applicable):-
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):USE_NVTX: OFF
    USE_GTEST: AUTO
    SUMMARIZE: OFF
    TVM_DEBUG_WITH_ABI_CHANGE: OFF
    USE_IOS_RPC: OFF
    USE_MSC: OFF
    USE_ETHOSU:
    CUDA_VERSION: NOT-FOUND
    USE_LIBBACKTRACE: AUTO
    DLPACK_PATH: 3rdparty/dlpack/include
    USE_TENSORRT_CODEGEN: OFF
    USE_THRUST: OFF
    USE_TARGET_ONNX: OFF
    USE_AOT_EXECUTOR: ON
    BUILD_DUMMY_LIBTVM: OFF
    USE_CUDNN: OFF
    USE_TENSORRT_RUNTIME: OFF
    USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
    USE_CCACHE: AUTO
    USE_ARM_COMPUTE_LIB: OFF
    USE_CPP_RTVM:
    USE_OPENCL_GTEST: /path/to/opencl/gtest
    TVM_LOG_BEFORE_THROW: OFF
    USE_MKL: OFF
    USE_PT_TVMDSOOP: OFF
    MLIR_VERSION: NOT-FOUND
    USE_CLML: OFF
    USE_STACKVM_RUNTIME: OFF
    USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
    ROCM_PATH: /opt/rocm
    USE_DNNL: OFF
    USE_MSCCL: OFF
    USE_VITIS_AI: OFF
    USE_MLIR: OFF
    USE_RCCL: OFF
    USE_LLVM: llvm-config --link-static
    USE_VERILATOR: OFF
    USE_TF_TVMDSOOP: OFF
    USE_THREADS: ON
    USE_MSVC_MT: OFF
    BACKTRACE_ON_SEGFAULT: OFF
    USE_GRAPH_EXECUTOR: ON
    USE_NCCL: OFF
    USE_ROCBLAS: OFF
    GIT_COMMIT_HASH: 2685d6ace64c30a077c1b3f6893d2e38589be7bb
    USE_VULKAN: ON
    USE_RUST_EXT: OFF
    USE_CUTLASS: OFF
    USE_CPP_RPC: OFF
    USE_HEXAGON: OFF
    USE_CUSTOM_LOGGING: OFF
    USE_UMA: OFF
    USE_FALLBACK_STL_MAP: OFF
    USE_SORT: ON
    USE_RTTI: ON
    GIT_COMMIT_TIME: 2024-09-07 15:18:06 -0400
    USE_HIPBLAS: OFF
    USE_HEXAGON_SDK: /path/to/sdk
    USE_BLAS: none
    USE_ETHOSN: OFF
    USE_LIBTORCH: OFF
    USE_RANDOM: ON
    USE_CUDA: OFF
    USE_COREML: OFF
    USE_AMX: OFF
    BUILD_STATIC_RUNTIME: OFF
    USE_CMSISNN: OFF
    USE_KHRONOS_SPIRV: OFF
    USE_CLML_GRAPH_EXECUTOR: OFF
    USE_TFLITE: OFF
    USE_HEXAGON_GTEST: /path/to/hexagon/gtest
    PICOJSON_PATH: 3rdparty/picojson
    USE_OPENCL_ENABLE_HOST_PTR: OFF
    INSTALL_DEV: OFF
    USE_PROFILER: ON
    USE_NNPACK: OFF
    LLVM_VERSION: 18.1.8
    USE_MRVL: OFF
    USE_OPENCL: OFF
    COMPILER_RT_PATH: 3rdparty/compiler-rt
    RANG_PATH: 3rdparty/rang/include
    USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
    USE_OPENMP: OFF
    USE_BNNS: OFF
    USE_FLASHINFER:
    USE_CUBLAS: OFF
    USE_METAL: OFF
    USE_MICRO_STANDALONE_RUNTIME: OFF
    USE_HEXAGON_EXTERNAL_LIBS: OFF
    USE_ALTERNATIVE_LINKER: AUTO
    USE_BYODT_POSIT: OFF
    USE_NVSHMEM: OFF
    USE_HEXAGON_RPC: OFF
    USE_MICRO: OFF
    DMLC_PATH: 3rdparty/dmlc-core/include
    INDEX_DEFAULT_I64: ON
    USE_RELAY_DEBUG: OFF
    USE_RPC: ON
    USE_TENSORFLOW_PATH: none
    TVM_CLML_VERSION:
    USE_MIOPEN: OFF
    USE_ROCM: OFF
    USE_PAPI: OFF
    USE_CURAND: OFF
    TVM_CXX_COMPILER_PATH: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe
    HIDE_PRIVATE_SYMBOLS: OFF
  • Any other relevant information: All models were downloaded in local and could work fine without speculative decoding.

Additional context

When I switch to "small_draft", the error message changed to following context.
(mlc-chat-env) C:\Users\Administrator\Desktop>python mlc_qs.py
[2024-11-04 03:06:21] INFO auto_device.py:88: Not found device: cuda:0
[2024-11-04 03:06:23] INFO auto_device.py:88: Not found device: rocm:0
[2024-11-04 03:06:24] INFO auto_device.py:88: Not found device: metal:0
[2024-11-04 03:06:25] INFO auto_device.py:79: Found device: vulkan:0
[2024-11-04 03:06:27] INFO auto_device.py:88: Not found device: opencl:0
[2024-11-04 03:06:27] INFO auto_device.py:35: Using device: vulkan:0
[2024-11-04 03:06:27] INFO engine_base.py:143: Using library model: dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll
[2024-11-04 03:06:27] INFO engine_base.py:143: Using library model: dist\libs\Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll
[2024-11-04 03:06:27] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-11-04 03:06:27] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2024-11-04 03:06:27] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "local", max batch size 7 is specified by user, max KV cache token capacity will be set to 1024, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "interactive", max batch size 7 is specified by user, max KV cache token capacity will be set to 1024, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:688: Under mode "server", max batch size 7 is specified by user, max KV cache token capacity will be set to 7168, prefill chunk size will be set to 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:769: The actual engine mode is "server". So max batch size is 7, max KV cache token capacity is 7168, prefill chunk size is 1024.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\config.cc:774: Estimated total single GPU memory usage: 3241.175 MB (Parameters: 498.145 MB. KVCache: 1456.284 MB. Temporary buffer: 1286.746 MB). The actual usage might be slightly larger than the estimated number.
[03:06:27] D:\a\package\package\mlc-llm\cpp\serve\engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\mlc_qs.py", line 667, in
test_engine_basic("dist/Qwen1.5-0.5B-Chat-q4f16_1-MLC","dist/libs/Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll",['dist\Qwen1.5-0.5B-Chat-q4f16_1-MLC','dist\libs\Qwen1.5-0.5B-Chat-q4f16_1-vulkan.dll'])
File "C:\Users\Administrator\Desktop\mlc_qs.py", line 121, in test_engine_basic
engine.step()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\mlc_llm\serve\sync_engine.py", line 351, in step
self._ffi"step"
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi_ctypes\packed_func.py", line 245, in call
raise_last_ffi_error()
File "C:\Users\Administrator\miniconda3\envs\mlc-chat-env\Lib\site-packages\tvm_ffi\base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
File "D:\a\package\package\mlc-llm\cpp\serve\engine_actions\batch_draft.cc", line 151
InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false:

@SkyHeroesS SkyHeroesS added the bug Confirmed bugs label Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

1 participant