Skip to content

Commit

Permalink
Merge pull request #889 from workingloong/polish-readme
Browse files Browse the repository at this point in the history
Add Atorch link and llama2 example in the README.
  • Loading branch information
samplise authored Dec 15, 2023
2 parents ac9651b + d4c1f5f commit 3385516
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ to improve the training performance and resources utilization.

## Latest News

- [2023/11] [ATorch supporting efficient and easy-to-use model training is released.](atorch/README.md)
- [2023/10] [AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference as Preconditioning Matrix, NeurIPS 2023.](atorch/docs/README-AGD.md)
- [2023/09] [Weighted Sharpness-Aware Minimization (WSAM) has been accepted by KDD'23.](atorch/docs/README-WSAM.md)
- [2023/08] [DLRover improves the stability of pre-trained model training over thousands of GPUs.](docs/blogs/stabilize_llm_training_cn.md)
- [2023/04] [DLRover auto-scales nodes of a DeepRec distributed training job.](docs/blogs/deeprec_autoscale_cn.md)
Expand Down Expand Up @@ -74,7 +76,7 @@ can reduce the overhead to restore the training.
| Step to restore training | Failure without DLRover | PS failure with DLRover | Worker failure with DLRover |
|:-----------------------------------------------:|:-------------------------:|:--------------------------:|:---------------------------:|
| Restore action | Restart Job | Restart failed PS | Restart failed workers |
| Schedule node, pull image and install packages | All nodes | Only new PS | Only new workers |
| Schedule node, pull image and install packages | All nodes | Only new PS | Only new workers |
| Start session | all nodes | all nodes | Only new workers |
| Initialize Graph | Yes | Yes | Only new workers |
| Restore checkpoint | Yes | Yes | No |
Expand Down Expand Up @@ -193,6 +195,9 @@ a model with DLRover.

## What's Next?

- Aysnchronously save the checkpoint to the storage.
- Significantly reduce checkpoint saving/restore time which blocks training.
- Save the checkpoint from the CPU memory even if the training process fails.
- Fine-grained automatic distributed training for GPU Synchronous jobs
- hybrid-parallel mode
- adapted hyper parameters adjustment with dynamic resources
Expand All @@ -212,3 +217,5 @@ Please refer to the [DEVELOPMENT](docs/developer_guide.md)
[Train a PyTorch Model on Kubernetes](docs/tutorial/torch_on_cloud.md)

[Train a GPT Model on Kubernetes](docs/tutorial/torch_ddp_nanogpt.md)

[Train a llama2 model.](examples/pytorch/llama2/README.md)

0 comments on commit 3385516

Please sign in to comment.