Skip to content

Release 0.3.4

Compare
Choose a tag to compare
@workingloong workingloong released this 21 Feb 07:10
· 636 commits to master since this release

Features:

  • Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
  • dlrover-run --auto-config Automatically configure the number of nodes and the number of processes per node.
  • Users can customize the APIs of storage to save the checkpoint into different file systems.
  • Deletion strategy to clean the old checkpoint files.

BugFix:

  • The shared memory does not exist if the size of the checkpoint changes.