Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev refactor xccl primitive #10613

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Conversation

Flowingsun007
Copy link
Contributor

@Flowingsun007 Flowingsun007 commented Jan 3, 2025

  • 推进解耦cuda nccl和oneflow的深度绑定,重构EagerCclCommMgr及ccl::Comm等模块,方便在kernel里直接使用设备无关的(类似primitive)的ccl通信调用实现,替代直接使用nccl apis,推进后续多设备兼容。
  • 后续支持/适配不同设备(cuda/npu/xpu等)时,原则上在kernel以及其他调用通信api的代码处,原则上不应该直接调用类似nccl这样的设备耦合的通信apis,而应该直接使用oneflow::ccl::Send/Recv/AllReduce/.... 等父类api(具体位于oneflow/user/kernels/collective_communication/include目录下)并提供子类实现
  • 后续各设备需继承oneflow::ccl通信apis实现自己的子类通信apis。
    • 如cuda设备需要通过nccl api实现oneflow::ccl::CudaSend/CudaRecv/CudaAllReduce....等。
    • npu设备需要通过hccl api实现oneflow::ccl::NpuSend/NpuRecv/NpuAllReduce等

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant