Release 0.2.2
ElasticJob
Features:
dlrover-run
can run on any distributed jobs with theNODE_RANK
andDLROVER_MASTER_ADDR
in the environment.- DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.
BugFix:
- Fix the bug to load the FSDP checkpoint.