Skip to content

Release 0.2.2

Compare
Choose a tag to compare
@workingloong workingloong released this 21 Nov 06:41
· 896 commits to master since this release

ElasticJob

Features:

  • dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment.
  • DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.

BugFix:

  • Fix the bug to load the FSDP checkpoint.