You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MLC-LLM should be load the sharded model across all 4 AMD Instinct MI100 GPUs and start inferring.
Issue is confirmed only with the bridge enabled, adding amdgpu.use_xgmi_p2p=0 to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only.
Here is the output when attempting to run with NCCL_DEBUG=INFO screenlog.txt
Actual behavior
/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:287 NCCL WARN Cuda failure 'invalid argument'
terminate called after throwing an instance of 'tvm::runtime::InternalError'
what(): [02:18:19] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:196: rcclErrror: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Stack trace:
0: _ZN3tvm7runtime6deta
1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
5: 0x00007ff61c0dc252
6: start_thread
at ./nptl/pthread_create.c:442
7: 0x00007ff64cd2665f
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
8: 0xffffffffffffffff
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100
How you installed MLC-LLM (conda, source): conda
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.10.12
TVM Unity Hash Tag: unity.txt
Steps to reproduce
Install MLC-LLM
Run the following python code to start loading/inferring
cm = ChatModule(model="goliath-120b-q4f16_1", chat_config=ChatConfig(
max_gen_len=4096,
conv_template="LM",
temperature=0.75,
repetition_penalty=1.1,
top_p=0.9,
tensor_parallel_shards=4,
context_window_size=4096
))
output = cm.generate(
prompt="What is the meaning of life?",
progress_callback=StreamToStdout(callback_interval=2),
)
The text was updated successfully, but these errors were encountered:
Expected behavior
MLC-LLM should be load the sharded model across all 4 AMD Instinct MI100 GPUs and start inferring.
Issue is confirmed only with the bridge enabled, adding
amdgpu.use_xgmi_p2p=0
to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only.Here is the output when attempting to run with
NCCL_DEBUG=INFO
screenlog.txt
Actual behavior
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100
How you installed MLC-LLM (conda, source): conda
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.10.12
TVM Unity Hash Tag: unity.txt
Steps to reproduce
Install MLC-LLM
Run the following python code to start loading/inferring
The text was updated successfully, but these errors were encountered: