Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error executing method determine_num_available_blocks: vLLM multi node fails for both DeepSeek-Coder-V2-Instruct and DeepSeek-Coder-V2-Lite-Instruct #76

Open
liangfang opened this issue Jul 28, 2024 · 1 comment

Comments

@liangfang
Copy link

首先想问一下DeepSeek有没有试过在vLLM multi node上运行过?
我是通过ray在2个node x 8 GPUs V100上以half(float16)运行

这是运行参数:

CUDA_LAUNCH_BLOCKING=1 OMP_NUM_THREADS=1 vllm serve deepseek-ai/DeepSeek-Coder-V2-Instruct --tensor-parallel-size 16 --dtype half --trust-remote-code --enforce-eager --enable-chunked-prefill=False

DeepSeek-Coder-V2-Lite-Instruct也是在determine_num_available_blocks 处fails, 但是报一个NCCL error:

(RayWorkerWrapper pid=23558, ip=10.0.128.18) ERROR 07-28 13:53:40 worker_base.py:382] RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers

root@g02-17:/vllm-workspace# PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True CUDA_LAUNCH_BLOCKING=1  OMP_NUM_THREADS=1 HF_ENDPOINT="https://hf-mirror.com" vllm serve deepseek-ai/DeepSeek-Coder-V2-Instruct --tensor-parallel-size 16 --dtype half --trust-remote-code --enforce-eager --enable-chunked-prefill=False --max-model-len 8192
INFO 07-28 13:54:35 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 07-28 13:54:35 api_server.py:220] args: Namespace(model_tag='deepseek-ai/DeepSeek-Coder-V2-Instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='deepseek-ai/DeepSeek-Coder-V2-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=16, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fa6bae68d30>)
WARNING 07-28 13:54:37 config.py:1425] Casting torch.bfloat16 to torch.float16.
INFO 07-28 13:54:37 config.py:715] Defaulting to use ray for distributed inference
2024-07-28 13:54:37,131	INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 10.0.128.17:6379...
2024-07-28 13:54:37,140	INFO worker.py:1788 -- Connected to Ray cluster.
INFO 07-28 13:54:38 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='deepseek-ai/DeepSeek-Coder-V2-Instruct', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-Coder-V2-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=16, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=deepseek-ai/DeepSeek-Coder-V2-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-28 13:55:28 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-28 13:55:28 selector.py:54] Using XFormers backend.
(RayWorkerWrapper pid=24600, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=24600, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:54] Using XFormers backend.
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 07-28 13:55:32 utils.py:784] Found nccl from library libnccl.so.2
INFO 07-28 13:55:32 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=77233) INFO 07-28 13:55:32 utils.py:784] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=77233) INFO 07-28 13:55:32 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-28 13:55:33 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 07-28 13:55:33 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='10.0.128.17', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fa3914c32b0>, local_subscribe_port=59601, local_sync_port=35331, remote_subscribe_port=53347, remote_sync_port=58569)
(RayWorkerWrapper pid=24524, ip=10.0.128.18) WARNING 07-28 13:55:33 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 07-28 13:55:33 model_runner.py:680] Starting to load model deepseek-ai/DeepSeek-Coder-V2-Instruct...
(RayWorkerWrapper pid=24524, ip=10.0.128.18) INFO 07-28 13:55:33 model_runner.py:680] Starting to load model deepseek-ai/DeepSeek-Coder-V2-Instruct...
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:54] Using XFormers backend. [repeated 14x across cluster]
(RayWorkerWrapper pid=24524, ip=10.0.128.18) Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
INFO 07-28 13:55:33 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-28 13:55:33 selector.py:54] Using XFormers backend.
INFO 07-28 13:55:34 weight_utils.py:223] Using model weights format ['*.safetensors']
(RayWorkerWrapper pid=24600, ip=10.0.128.18) INFO 07-28 13:55:35 weight_utils.py:223] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/55 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/55 [00:01<01:00,  1.11s/it]
Loading safetensors checkpoint shards:   4% Completed | 2/55 [00:02<01:02,  1.18s/it]
Loading safetensors checkpoint shards:   5% Completed | 3/55 [00:03<01:01,  1.18s/it]
Loading safetensors checkpoint shards:   7% Completed | 4/55 [00:04<01:00,  1.18s/it]
Loading safetensors checkpoint shards:   9% Completed | 5/55 [00:05<00:59,  1.19s/it]
Loading safetensors checkpoint shards:  11% Completed | 6/55 [00:07<00:57,  1.17s/it]
Loading safetensors checkpoint shards:  13% Completed | 7/55 [00:08<00:56,  1.18s/it]
Loading safetensors checkpoint shards:  15% Completed | 8/55 [00:09<00:55,  1.18s/it]
Loading safetensors checkpoint shards:  16% Completed | 9/55 [00:10<00:54,  1.18s/it]
Loading safetensors checkpoint shards:  18% Completed | 10/55 [00:11<00:51,  1.15s/it]
Loading safetensors checkpoint shards:  20% Completed | 11/55 [00:12<00:50,  1.14s/it]
Loading safetensors checkpoint shards:  22% Completed | 12/55 [00:13<00:49,  1.14s/it]
Loading safetensors checkpoint shards:  24% Completed | 13/55 [00:14<00:46,  1.10s/it]
Loading safetensors checkpoint shards:  25% Completed | 14/55 [00:16<00:45,  1.10s/it]
Loading safetensors checkpoint shards:  27% Completed | 15/55 [00:17<00:44,  1.12s/it]
Loading safetensors checkpoint shards:  29% Completed | 16/55 [00:18<00:43,  1.12s/it]
Loading safetensors checkpoint shards:  31% Completed | 17/55 [00:19<00:43,  1.13s/it]
Loading safetensors checkpoint shards:  33% Completed | 18/55 [00:20<00:42,  1.15s/it]
Loading safetensors checkpoint shards:  35% Completed | 19/55 [00:21<00:41,  1.16s/it]
Loading safetensors checkpoint shards:  36% Completed | 20/55 [00:23<00:40,  1.16s/it]
Loading safetensors checkpoint shards:  38% Completed | 21/55 [00:24<00:39,  1.17s/it]
Loading safetensors checkpoint shards:  40% Completed | 22/55 [00:25<00:38,  1.16s/it]
Loading safetensors checkpoint shards:  42% Completed | 23/55 [00:26<00:36,  1.15s/it]
Loading safetensors checkpoint shards:  44% Completed | 24/55 [00:27<00:35,  1.15s/it]
Loading safetensors checkpoint shards:  45% Completed | 25/55 [00:28<00:34,  1.14s/it]
Loading safetensors checkpoint shards:  47% Completed | 26/55 [00:29<00:32,  1.12s/it]
Loading safetensors checkpoint shards:  49% Completed | 27/55 [00:30<00:31,  1.11s/it]
Loading safetensors checkpoint shards:  51% Completed | 28/55 [00:32<00:29,  1.11s/it]
Loading safetensors checkpoint shards:  53% Completed | 29/55 [00:33<00:28,  1.11s/it]
Loading safetensors checkpoint shards:  55% Completed | 30/55 [00:34<00:28,  1.12s/it]
Loading safetensors checkpoint shards:  56% Completed | 31/55 [00:35<00:27,  1.13s/it]
Loading safetensors checkpoint shards:  58% Completed | 32/55 [00:36<00:26,  1.15s/it]
Loading safetensors checkpoint shards:  60% Completed | 33/55 [00:37<00:25,  1.14s/it]
Loading safetensors checkpoint shards:  62% Completed | 34/55 [00:38<00:23,  1.12s/it]
Loading safetensors checkpoint shards:  64% Completed | 35/55 [00:39<00:22,  1.11s/it]
Loading safetensors checkpoint shards:  65% Completed | 36/55 [00:41<00:21,  1.14s/it]
Loading safetensors checkpoint shards:  67% Completed | 37/55 [00:42<00:20,  1.15s/it]
Loading safetensors checkpoint shards:  69% Completed | 38/55 [00:43<00:19,  1.13s/it]
Loading safetensors checkpoint shards:  71% Completed | 39/55 [00:44<00:18,  1.13s/it]
Loading safetensors checkpoint shards:  73% Completed | 40/55 [00:45<00:16,  1.11s/it]
Loading safetensors checkpoint shards:  75% Completed | 41/55 [00:46<00:15,  1.10s/it]
Loading safetensors checkpoint shards:  76% Completed | 42/55 [00:47<00:14,  1.10s/it]
Loading safetensors checkpoint shards:  78% Completed | 43/55 [00:48<00:13,  1.10s/it]
Loading safetensors checkpoint shards:  80% Completed | 44/55 [00:49<00:12,  1.11s/it]
Loading safetensors checkpoint shards:  82% Completed | 45/55 [00:51<00:11,  1.13s/it]
Loading safetensors checkpoint shards:  84% Completed | 46/55 [00:52<00:10,  1.11s/it]
Loading safetensors checkpoint shards:  85% Completed | 47/55 [00:53<00:08,  1.12s/it]
Loading safetensors checkpoint shards:  87% Completed | 48/55 [00:54<00:07,  1.13s/it]
Loading safetensors checkpoint shards:  89% Completed | 49/55 [00:55<00:06,  1.14s/it]
Loading safetensors checkpoint shards:  91% Completed | 50/55 [00:56<00:05,  1.12s/it]
Loading safetensors checkpoint shards:  93% Completed | 51/55 [00:57<00:04,  1.13s/it]
Loading safetensors checkpoint shards:  95% Completed | 52/55 [00:59<00:03,  1.14s/it]
Loading safetensors checkpoint shards:  96% Completed | 53/55 [01:00<00:02,  1.15s/it]
Loading safetensors checkpoint shards:  98% Completed | 54/55 [01:01<00:01,  1.15s/it]
(RayWorkerWrapper pid=77774) INFO 07-28 13:56:37 model_runner.py:692] Loading model weights took 28.7795 GB
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:32 utils.py:784] Found nccl from library libnccl.so.2 [repeated 14x across cluster]
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:32 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) WARNING 07-28 13:55:33 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) INFO 07-28 13:55:33 model_runner.py:680] Starting to load model deepseek-ai/DeepSeek-Coder-V2-Instruct... [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) INFO 07-28 13:55:33 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 15x across cluster]
(RayWorkerWrapper pid=77774) INFO 07-28 13:55:33 selector.py:54] Using XFormers backend. [repeated 15x across cluster]
(RayWorkerWrapper pid=77774) Cache shape torch.Size([163840, 64]) [repeated 14x across cluster]
(RayWorkerWrapper pid=77389) INFO 07-28 13:55:35 weight_utils.py:223] Using model weights format ['*.safetensors'] [repeated 14x across cluster]
Loading safetensors checkpoint shards: 100% Completed | 55/55 [01:02<00:00,  1.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 55/55 [01:02<00:00,  1.14s/it]

INFO 07-28 13:56:38 model_runner.py:692] Loading model weights took 28.6821 GB
(RayWorkerWrapper pid=77233) INFO 07-28 13:56:42 model_runner.py:692] Loading model weights took 28.7795 GB [repeated 5x across cluster]
(RayWorkerWrapper pid=24677, ip=10.0.128.18) INFO 07-28 13:57:02 model_runner.py:692] Loading model weights took 28.7795 GB [repeated 2x across cluster]
ERROR 07-28 13:57:06 worker_base.py:382] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
ERROR 07-28 13:57:06 worker_base.py:382] Traceback (most recent call last):
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 07-28 13:57:06 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-28 13:57:06 worker_base.py:382]     return func(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
ERROR 07-28 13:57:06 worker_base.py:382]     self.model_runner.profile_run()
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-28 13:57:06 worker_base.py:382]     return func(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
ERROR 07-28 13:57:06 worker_base.py:382]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-28 13:57:06 worker_base.py:382]     return func(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_or_intermediate_states = model_executable(
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 454, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 421, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_states, residual = layer(positions, hidden_states,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 379, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_states = self.mlp(hidden_states)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 139, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     final_hidden_states = self.experts(
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 250, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     final_hidden_states = self.quant_method.apply(
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 75, in apply
ERROR 07-28 13:57:06 worker_base.py:382]     return self.forward(x, layer.w13_weight, layer.w2_weight,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 13, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     return self._forward_method(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 92, in forward_cuda
ERROR 07-28 13:57:06 worker_base.py:382]     return fused_moe(x,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613, in fused_moe
ERROR 07-28 13:57:06 worker_base.py:382]     return fused_experts(hidden_states,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 511, in fused_experts
ERROR 07-28 13:57:06 worker_base.py:382]     moe_align_block_size(curr_topk_ids, config['BLOCK_SIZE_M'], E))
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 218, in moe_align_block_size
ERROR 07-28 13:57:06 worker_base.py:382]     ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper
ERROR 07-28 13:57:06 worker_base.py:382]     return fn(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 395, in moe_align_block_size
ERROR 07-28 13:57:06 worker_base.py:382]     torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in __call__
ERROR 07-28 13:57:06 worker_base.py:382]     return self_._op(*args, **(kwargs or {}))
ERROR 07-28 13:57:06 worker_base.py:382] RuntimeError: CUDA error: invalid argument
ERROR 07-28 13:57:06 worker_base.py:382] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
...
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] RuntimeError: CUDA error: invalid argument [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [repeated 14x across cluster]```

@onlybeyou
Copy link

完全一样的问题,且观察到GPU利用只有50%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants