You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
#1243
Open
dotsonliu opened this issue
Aug 19, 2024
· 1 comment
The text was updated successfully, but these errors were encountered:
majieyue
changed the title
您好,请问以下dlrover和megatron是什么关系?megatron没有容灾监控功能,借用dlrover这部分能力吗?怎么集成?如果突然一个GPU坏了,tp pp这些都变了,怎么动态兼容?
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
Sep 19, 2024
No description provided.
The text was updated successfully, but these errors were encountered: