Skip to content

Releases: intelligent-machine-learning/dlrover

Release 0.3.8

29 Sep 02:02
bdc5ed2
Compare
Choose a tag to compare

Features:

  • Added the basic implementation of the first version of positive diagnostics.
  • Supported 'fast-fail' strategy for training job in some boundary scenarios. e.g. pending case
  • Accelerate(sync -> async) pod creation.
  • Added the basic implementation of structured event logging.

BugFix:

  • Fixed unexpected rendezvous failure in occasional fault-tolerant scenarios.
  • Fixed unexpected socket client creation before socket socket creation.
  • Optimized 'network-check' implementation for 'Ascend NPU'.
  • Optimized some implementations for master-fault-tolerance(internal) scenario.
  • Other numerous known issues fixed and optimized.

Release 0.3.7

13 May 06:02
Compare
Choose a tag to compare

Features:

  • Flash Checkpoint suppors deleting old checkpoints.

BugFix:

  • Save/load the non-params-related variables of dist optimizer in Megatron-LM models.
  • The agent waits for async saving checkpoint finishes before exiting.

Release 0.3.6

24 Apr 06:18
Compare
Choose a tag to compare

Features:

Flash checkpoint provides FlashCkptTrainer to support HuggingFace transforemers.Trainer.
Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
Job master can sort the node ranks by the access switches of the node.

BugFix:

Fix the segment fault when restarting the training process.

Release 0.3.5

29 Mar 07:02
Compare
Choose a tag to compare

Features:

  • Flash checkpoint supports saving and loading Megatron-LM MOE models. #1042
  • APIs to extend the module to check the node with different chips. #1023
  • Automatically mark the node as unschedulable if the node fails. #1025

BugFix:

  • Fix the DDP example of mnist to save and load checkpoint. #1051
  • Fix the checkpoint name of DDP. #1034

Release 0.3.4

21 Feb 07:10
Compare
Choose a tag to compare

Features:

  • Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
  • dlrover-run --auto-config Automatically configure the number of nodes and the number of processes per node.
  • Users can customize the APIs of storage to save the checkpoint into different file systems.
  • Deletion strategy to clean the old checkpoint files.

BugFix:

  • The shared memory does not exist if the size of the checkpoint changes.

Release 0.3.3

25 Jan 02:28
Compare
Choose a tag to compare

Features:

  • Support Python > 3.10.
  • Support restarting the training process on Ascend NPU.
  • Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:

  • Fix the checkpoint shard inconsistency of all ranks.
  • Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
  • Fix the bug to load the Megatron-LM checkpoint.

Release 0.3.1

10 Jan 01:54
Compare
Choose a tag to compare

Feature:

  • Users can use flash checkpoint using torchrun or python -m torch.distributed.launch.

Bugfix:

  • The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.

Release 0.3.0

03 Jan 06:54
Compare
Choose a tag to compare

Features:

  • Flash Checkpoint to asynchronously persist checkpoint to storage.
  • Flash Checkpoint recovers failure in memory.
  • Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
  • Node detection supports NPU.

Examples

  • The example of training nanoGPT using DeepSpeed.
  • The example to save/load sharding FSDP checkpoint.

Release 0.2.2

21 Nov 06:41
Compare
Choose a tag to compare

ElasticJob

Features:

  • dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment.
  • DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.

BugFix:

  • Fix the bug to load the FSDP checkpoint.

Release 0.2.1

11 Oct 09:38
48fa032
Compare
Choose a tag to compare

DLRover:

ElasticJob:

  • Autotuning batch size without restarting the job.
  • Automatically detect the straggler (slow worker).

TFPlus

TFPlus 0.1.0 has been released, see detail in https://github.com/intelligent-machine-learning/dlrover/tree/master/tfplus

Kv Variable (Core Embedding Capability)

  • High-performance Embedding Ops
  • Kv Variable low level APIs (4 in total)
    • tfplus.get_kv_variable
    • embedding_lookup
    • embedding_lookup_sparse
    • safe_embedding_lookup_sparse
  • Dynamic expansion and partitioning of Embedding weights
  • Support for both single-machine training and PS/Worker cluster training

High-performance Optimizers

  • Common optimizers compatible with Kv Variable
    • Adam
    • Adagrad
  • In-house deep learning optimizers based on Sparse Group Lasso
    • Group Adam
    • Group Adagrad