29 Sep 02:02

BalaBalaYi

bdc5ed2

Release 0.3.8 Latest

Latest

Features:

Added the basic implementation of the first version of positive diagnostics.
Supported 'fast-fail' strategy for training job in some boundary scenarios. e.g. pending case
Accelerate(sync -> async) pod creation.
Added the basic implementation of structured event logging.

BugFix:

Fixed unexpected rendezvous failure in occasional fault-tolerant scenarios.
Fixed unexpected socket client creation before socket socket creation.
Optimized 'network-check' implementation for 'Ascend NPU'.
Optimized some implementations for master-fault-tolerance(internal) scenario.
Other numerous known issues fixed and optimized.

Assets 2

13 May 06:02

workingloong

v0.3.7

2160cdc

Release 0.3.7

Features:

Flash Checkpoint suppors deleting old checkpoints.

BugFix:

Save/load the non-params-related variables of dist optimizer in Megatron-LM models.
The agent waits for async saving checkpoint finishes before exiting.

Assets 2

24 Apr 06:18

workingloong

v0.3.6

c0134ec

Release 0.3.6

Features:

Flash checkpoint provides FlashCkptTrainer to support HuggingFace transforemers.Trainer.
Flash checkpoint supports loading the checkpint of Megatron-LM from the memory.
Flash Checkpoint supports saving and loading FSDP checkpoint with full state dict.
Job master can sort the node ranks by the access switches of the node.

BugFix:

Fix the segment fault when restarting the training process.

Assets 2

29 Mar 07:02

workingloong

v0.3.5

af5fdbc

Release 0.3.5

Features:

Flash checkpoint supports saving and loading Megatron-LM MOE models. #1042
APIs to extend the module to check the node with different chips. #1023
Automatically mark the node as unschedulable if the node fails. #1025

BugFix:

Fix the DDP example of mnist to save and load checkpoint. #1051
Fix the checkpoint name of DDP. #1034

Assets 2

21 Feb 07:10

workingloong

v0.3.4

185d871

Release 0.3.4

Features:

Flash checkpoint enables saving and loading Megatron-LM models from multiple ranks in parallel.
dlrover-run --auto-config Automatically configure the number of nodes and the number of processes per node.
Users can customize the APIs of storage to save the checkpoint into different file systems.
Deletion strategy to clean the old checkpoint files.

BugFix:

The shared memory does not exist if the size of the checkpoint changes.

Assets 2

25 Jan 02:28

workingloong

v0.3.3

654240d

Release 0.3.3

Features:

Support Python > 3.10.
Support restarting the training process on Ascend NPU.
Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:

Fix the checkpoint shard inconsistency of all ranks.
Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
Fix the bug to load the Megatron-LM checkpoint.

Assets 2

10 Jan 01:54

workingloong

v0.3.1

222edf7

Release 0.3.1

Feature:

Users can use flash checkpoint using torchrun or python -m torch.distributed.launch.

Bugfix:

The dlrover master cannot print the error message of the fault node in a kubeflow/PytorchJob.

Assets 2

03 Jan 06:54

workingloong

v0.3.0

ce88437

Release 0.3.0

Features:

Flash Checkpoint to asynchronously persist checkpoint to storage.
Flash Checkpoint recovers failure in memory.
Flash Checkpoint supports DDP/FSDP/DeepSpeed/Megatron
Node detection supports NPU.

Examples

The example of training nanoGPT using DeepSpeed.
The example to save/load sharding FSDP checkpoint.

Assets 2

21 Nov 06:41

workingloong

v0.2.2

8736094

Release 0.2.2

ElasticJob

Features:

dlrover-run can run on any distributed jobs with the NODE_RANK and DLROVER_MASTER_ADDR in the environment.
DLRover can asynchronously save the checkpoint into the storage which only block the training with a few time.

BugFix:

Fix the bug to load the FSDP checkpoint.

Assets 2

11 Oct 09:38

workingloong

v0.2.1

48fa032

Release 0.2.1

DLRover:

ElasticJob:

Autotuning batch size without restarting the job.
Automatically detect the straggler (slow worker).

TFPlus

TFPlus 0.1.0 has been released, see detail in https://github.com/intelligent-machine-learning/dlrover/tree/master/tfplus

Kv Variable (Core Embedding Capability)

High-performance Embedding Ops
Kv Variable low level APIs (4 in total)
- tfplus.get_kv_variable
- embedding_lookup
- embedding_lookup_sparse
- safe_embedding_lookup_sparse
Dynamic expansion and partitioning of Embedding weights
Support for both single-machine training and PS/Worker cluster training

High-performance Optimizers

Common optimizers compatible with Kv Variable
- Adam
- Adagrad
In-house deep learning optimizers based on Sparse Group Lasso
- Group Adam
- Group Adagrad

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features:

BugFix:

Features:

BugFix:

Features:

BugFix:

Features:

BugFix:

Features:

Features:

BugFix:

Feature:

Bugfix:

Features:

Examples

ElasticJob

Features:

BugFix:

DLRover:

ElasticJob:

TFPlus

Kv Variable (Core Embedding Capability)

High-performance Optimizers

Releases: intelligent-machine-learning/dlrover

Release 0.3.8

Features:

BugFix:

Release 0.3.7

Features:

BugFix:

Release 0.3.6

Features:

BugFix:

Release 0.3.5

Features:

BugFix:

Release 0.3.4

Features:

Release 0.3.3

Features:

BugFix:

Release 0.3.1

Feature:

Bugfix:

Release 0.3.0

Features:

Examples

Release 0.2.2

ElasticJob

Features:

BugFix:

Release 0.2.1

DLRover:

ElasticJob:

TFPlus

Kv Variable (Core Embedding Capability)

High-performance Optimizers