Release 0.3.4
Features:
- Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
dlrover-run --auto-config
Automatically configure the number of nodes and the number of processes per node.- Users can customize the APIs of storage to save the checkpoint into different file systems.
- Deletion strategy to clean the old checkpoint files.
BugFix:
- The shared memory does not exist if the size of the checkpoint changes.