Skip to content

Release 0.3.1

Compare
Choose a tag to compare
@workingloong workingloong released this 10 Jan 01:54
· 727 commits to master since this release

Feature:

  • Users can use flash checkpoint using torchrun or python -m torch.distributed.launch.

Bugfix:

  • The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.