[Bug] InitCCLPerWorker Fails when using AMD GPU Bridge #317

TNT3530 · 2024-04-15T22:51:27Z

Expected behavior

MLC-LLM should be load the sharded model across all 4 AMD Instinct MI100 GPUs and start inferring.
Issue is confirmed only with the bridge enabled, adding amdgpu.use_xgmi_p2p=0 to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only.

Here is the output when attempting to run with NCCL_DEBUG=INFO
screenlog.txt

Actual behavior

/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:287 NCCL WARN Cuda failure 'invalid argument'

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [02:18:19] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:196: rcclErrror: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Stack trace:
  0: _ZN3tvm7runtime6deta
  1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
  4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
  5: 0x00007ff61c0dc252
  6: start_thread
        at ./nptl/pthread_create.c:442
  7: 0x00007ff64cd2665f
        at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  8: 0xffffffffffffffff

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100
How you installed MLC-LLM (conda, source): conda
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.10.12
TVM Unity Hash Tag: unity.txt

Steps to reproduce

Install MLC-LLM
Run the following python code to start loading/inferring

cm = ChatModule(model="goliath-120b-q4f16_1", chat_config=ChatConfig(
	max_gen_len=4096,
	conv_template="LM",
	temperature=0.75,
	repetition_penalty=1.1,
	top_p=0.9,
	tensor_parallel_shards=4,
	context_window_size=4096
))

output = cm.generate(
    prompt="What is the meaning of life?",
    progress_callback=StreamToStdout(callback_interval=2),
)

The text was updated successfully, but these errors were encountered:

TNT3530 · 2024-08-28T02:55:01Z

Note: This has been fixed as of mlc 0.15.dev544

TNT3530 mentioned this issue Apr 18, 2024

[Bug] unhandled cuda error with ROCm 5.7 mlc-ai/mlc-llm#2160

Closed

TNT3530 closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] InitCCLPerWorker Fails when using AMD GPU Bridge #317

[Bug] InitCCLPerWorker Fails when using AMD GPU Bridge #317

TNT3530 commented Apr 15, 2024

TNT3530 commented Aug 28, 2024

[Bug] InitCCLPerWorker Fails when using AMD GPU Bridge #317

[Bug] InitCCLPerWorker Fails when using AMD GPU Bridge #317

Comments

TNT3530 commented Apr 15, 2024

Expected behavior

Actual behavior

Environment

Steps to reproduce

TNT3530 commented Aug 28, 2024