Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Port to ROCm/HIP #178

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

[WIP] Port to ROCm/HIP #178

wants to merge 5 commits into from

Conversation

fxzjshm
Copy link

@fxzjshm fxzjshm commented Feb 12, 2025

An attempt to port code to AMD ROCm platform.

GPTQ_marlin temporarily disabled, waiting for porting by experts.

Compat layer from llama.cpp; also comtains some MUSA-related port code, so @MooreThreads

Related: #159

Credit: @leavelet

Tested on 1 * Radeon 7900XTX 48GB + 1 * EPYC 9654 (12 * DDR5-4800) w/ pytorch 2.6.0+rocm6.2.4, DeepSeek-R1-Q4_K_M,

Performance(T/s): prefill 4.693228619234502, decode 4.563661526344222. Time(s): tokenize 0.0442044734954834, prefill 1.278437614440918, decode 18.187150716781616

Tested on 1 * Radeon 7900XTX 48GB + 2 * EPYC 9174F (but only (4 + 3) * DDR5-4800) w/ pytorch 2.6.0+rocm6.2.4, DeepSeek-R1-Q4_K_M,

Performance(T/s): prefill 1.3510978223059054, decode 1.354754309072004. Time(s): tokenize 0.1658039093017578, prefill 5.9211108684539795, decode 102.6016297340393

Not working now for 2 * 7900XTX 48GB (model loads but segmentation fault in hipGraph):

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fffa8b2e53b in hip::GraphKernelNode::copyParams(hipKernelNodeParams const*) ()                        
   from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libamdhip64.so                   
(gdb) bt                                                                                                                                                               
#0  0x00007fffa8b2e53b in hip::GraphKernelNode::copyParams(hipKernelNodeParams const*) ()
   from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#1  0x00007fffa8b53954 in hip::GraphKernelNode::clone() const () from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#2  0x00007fffa8ade400 in hip::Graph::clone(std::unordered_map<hip::GraphNode*, hip::GraphNode*, std::hash<hip::GraphNode*>, std::equal_to<hip::GraphNode*>, std::alloc
ator<std::pair<hip::GraphNode* const, hip::GraphNode*> > >&) const () from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#3  0x00007fffa8b28dd0 in hip::ihipGraphInstantiate(hip::GraphExec**, hip::Graph*, unsigned long) ()                                
   from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libamdhip64.so                              
#4  0x00007fffa8b2a20a in hip::hipGraphInstantiateWithFlags(hipGraphExec**, ihipGraph*, unsigned long long) ()
   from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libamdhip64.so    
#5  0x00007fffd5e63107 in at::cuda::CUDAGraph::capture_end() () from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#6  0x00007ffff47c6772 in torch::detail::wrap_pybind_function_impl_<void (at::cuda::CUDAGraph::*)(), 0ul, true>(void (at::cuda::CUDAGraph::*&&)(), std::integer_sequenc
e<unsigned long, 0ul>, std::integral_constant<bool, true>)::{lambda(at::cuda::CUDAGraph&)#1}::operator()(at::cuda::CUDAGraph&) const ()
   from /home/user/venv/torch-venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so
......

publishing this here in case can help someone.

@yeahdongcn
Copy link
Contributor

I will look into this on the MUSA side.

@fxzjshm
Copy link
Author

fxzjshm commented Feb 14, 2025

如果 flash_attention 安装不了, 可暂时注释掉使用 flash_attention 的部分代码:
If cannot install flash_attention, you may comment out codes related to flash_attention as a workaround:

diff --git a/ktransformers/operators/models.py b/ktransformers/operators/models.py
index 5d2e911..7af4e0d 100644
--- a/ktransformers/operators/models.py
+++ b/ktransformers/operators/models.py
@@ -19,7 +19,7 @@ import torch.nn.functional as F
 import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
-from ktransformers.operators.dynamic_attention import DynamicScaledDotProductAttention
+# from ktransformers.operators.dynamic_attention import DynamicScaledDotProductAttention
 from ktransformers.server.config.config import Config
 import os
 import yaml

这边暂时没有足够好的硬件来测速, 目前仅能确认 AMD 单卡可运行 (默认配置需要 > 32 GB 显存), 速率见上条; 同样的原因摩尔线程测试不了, 因此没做进一步改动.
Here we don't have enough hardware to benchmark, can only comfirm that it can run on single AMD card (default setup requires VRAM > 32 GB), rate shown in last comment; cannot test Moore Threads for same reason, so no further code modification.

在 32 GB 的 MI100 下可使用如下配置做拆分; 其它显存大小可相应调整层数.
For MI100 w/ 32 GB VRAM the following config can be used to split model; users with other VRAM size may adjust number of layers accordingly.

DeepSeek-V3-Chat-mixed.yaml

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cpu"
        prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3456][0-9])\\."
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3RotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbeddingV3
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\.(?!self_attn\\.kv_b_proj).*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearTorch"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([3456][0-9])\\.(?!self_attn\\.kv_b_proj).*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
      generate_op: "KLinearCPUInfer"
      prefill_op: "KLinearCPUInfer"
  
- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3456][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek_v3.DeepseekV3MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV3MoE     # mlp module with custom forward function
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\.mlp\\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGate
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3456][0-9])\\.mlp\\.gate$"
    class: ktransformers.models.modeling_deepseek_v3.MoEGate
  replace:
    class: ktransformers.operators.gate.KMoEGate     # mlp module with custom forward function
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.([3456][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cpu"
      prefill_op: "KExpertsCPU"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cpu"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3456][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
      transfer_map: 
        60: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[12345][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

- match:
    name: "(^model\\.layers\\.([3456][0-9])\\.)|(model.norm)|(lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

对于多卡下 hipGraph 崩溃的问题, 暂时没有头绪.
For crash in hipGraph with multiple cards, no idea what's happening.

@kingofotaku @carsonfly 能否帮忙测试一下?

@fxzjshm fxzjshm marked this pull request as ready for review February 14, 2025 17:09
@chf2000
Copy link

chf2000 commented Feb 15, 2025

Not enough VRAM to run on a single Radeon 7900XTX 24GB without GPTQ_marlin and crashed on 2 * Radeon 7900XTX 24GB.

@Atream
Copy link
Contributor

Atream commented Feb 15, 2025

Does it support a ROCm feature similar to CUDA Graph? Without a compute graph, the decode process can get stuck in Python, resulting in a loss of nearly 50%.

@yeahdongcn
Copy link
Contributor

Is there any chance to update the llama.cpp submodule? Its commit is 7 months old, and there is no MUSA-compatible code there.

@fxzjshm
Copy link
Author

fxzjshm commented Feb 17, 2025

Does it support a ROCm feature similar to CUDA Graph? Without a compute graph, the decode process can get stuck in Python, resulting in a loss of nearly 50%.

注意到多卡情形的崩溃出现在 hipGraph 模块内, 这可能表明 hipGraph 已经被启用, 并且可能有问题.
It is noticed that crash with multiple GPUs happens in hipGraph module, which may indicate that hipGraph is enabled and maybe has issue.

Is there any chance to update the llama.cpp submodule? Its commit is 7 months old, and there is no MUSA-compatible code there.

这大概得问原作者.
You probably have to ask the original authors.

@citrix123
Copy link

citrix123 commented Feb 18, 2025

Hi @fxzjshm

I am trying like below on single and 2 x W7900.

python -m ktransformers.local_chat  --model_path deepseek-ai/DeepSeek-R1 --gguf_path /ML/models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --port 10002 --optimize_config_path /ML/src/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml

But i get below error

Chat: Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/local_chat.py", line 179, in <module>
    fire.Fire(local_chat)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/local_chat.py", line 173, in local_chat
    generated = prefill_and_generate(
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/util/utils.py", line 150, in prefill_and_generate
    logits = model(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/models/modeling_deepseek_v3.py", line 1688, in forward
    outputs = self.model(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/operators/models.py", line 722, in forward
    layer_outputs = decoder_layer(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/models/modeling_deepseek_v3.py", line 1205, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/operators/attention.py", line 372, in forward
    return self.forward_chunck(
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/operators/attention.py", line 273, in forward_chunck
    q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/operators/linear.py", line 410, in forward
    return self.generate_linear.forward(x)
  File "/ML/ext/external/ML/src/ktransformers/ktransformers/operators/linear.py", line 227, in forward
    x = KTransformersOps.gptq_marlin_gemm(
NotImplementedError: marlin_gemm(..) requires CUDA_ARCH >= 8.

Do you know what am i doing wrong ?

@fxzjshm
Copy link
Author

fxzjshm commented Feb 18, 2025

@citrix123 Since GPTQ_marlin currently not implemented for AMD GPU, you need same workaround as #150 (comment) #138 (comment)

@citrix123
Copy link

Thanks for the quick response @fxzjshm

Seems i am still seeing the same issue ? Below is the change i have made

diff --git a/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml b/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml
index 06ab4db..fa082dc 100644
--- a/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml
+++ b/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml
@@ -31,7 +31,7 @@
     kwargs:
       generate_device: "cuda:0"
       prefill_device: "cuda:0"
-      generate_op: "KLinearMarlin"
+      generate_op: "KLinearTorch"
       prefill_op: "KLinearTorch"

 - match:
@@ -42,7 +42,7 @@
     kwargs:
       generate_device: "cuda:1"
       prefill_device: "cuda:1"
-      generate_op: "KLinearMarlin"
+      generate_op: "KLinearTorch"
       prefill_op: "KLinearTorch"

I have even tried to rebuild using install.sh and command

python -m ktransformers.local_chat  --model_path deepseek-ai/DeepSeek-R1 --gguf_path /ML/models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --port 10002 --optimize_config_path /ML/src/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml

Is there something i misunderstood ?

@citrix123
Copy link

citrix123 commented Feb 18, 2025

Thanks, I can reproduce issue mentioned by you.
Not sure why local_chat didn't worked for me :) Have you used the same command to test at your side ?

ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /ML/models/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --port 10002 --optimize_config_path /ML/src/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml
Backtrace
#0  0x00007a93b8e76df8 in ?? () from /opt/rocm/lib/libamdhip64.so.6
#1  0x00007a93b8ea697f in ?? () from /opt/rocm/lib/libamdhip64.so.6
#2  0x00007a93b8e6a876 in ?? () from /opt/rocm/lib/libamdhip64.so.6
#3  0x00007a93b8ce58a0 in ?? () from /opt/rocm/lib/libamdhip64.so.6
#4  0x00007a93b8ce9103 in ?? () from /opt/rocm/lib/libamdhip64.so.6
#5  0x00007a93b8ceb2ee in ?? () from /opt/rocm/lib/libamdhip64.so.6
#6  0x00007a93e30bc0ed in at::native::copy_device_to_device(at::TensorIterator&, bool, bool) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#7  0x00007a93f09122e9 in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) [clone .isra.0] () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007a93f0913c50 in at::native::copy_(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007a93f16d4b98 in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007a93f0c0e772 in at::native::_to_copy(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) ()
   from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007a93f1aab4f1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007a93f1114fcc in at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#13 0x00007a93f18d3d54 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#14 0x00007a93f1114fcc in at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007a93f385f428 in torch::autograd::VariableType::(anonymous namespace)::_to_copy(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007a93f385f8b4 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007a93f11ce626 in at::_ops::_to_copy::call(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) ()
   from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007a93f0c0b7c4 in at::native::to(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, bool, std::optional<c10::MemoryFormat>) ()
   from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007a93f1cac6d7 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_layout_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, bool, std::optional<c10::MemoryFormat>) () from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007a93f139076b in at::_ops::to_dtype_layout::call(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, bool, std::optional<c10::MemoryFormat>) ()
   from /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so

@fxzjshm
Copy link
Author

fxzjshm commented Feb 19, 2025

@citrix123 Yes, same command for multi-GPU case.


We managed to get access of an EPYC 9654 platform, results updated in the first post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants