Release 0.3.1
Feature:
- Users can use flash checkpoint using
torchrun
orpython -m torch.distributed.launch
.
Bugfix:
- The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.
torchrun
or python -m torch.distributed.launch
.