GitHub - 9Tempest/awesome-ML-heterogeneous-gpu-papers: This repo collects list of papers targeting support ML training/inference on heterogeneous gpu cluster, which is a less studied field

Background and Motivation

ML system papers targeting efficient training on heterogeneous cluster(cluster with different types of devices) are less studied than homogeneous cluster(cluster with same type of devices). However, there is a growing interest in this area. The motivation of having heterogeneous cluster in distributed training are:

for data centers, the use of heterogeneous GPUs is inevitable due to the short release cycle of new GPU architecture
for users, they can purchase spot instance with a combination of available and cheap heterogeneous devices to reduce expense and failure's cost(when one type of device failed because of out-biling(bidding price is lower than spot price), the training can still continue on other types of devices).

We have categorized different challenges brought by heterogeneous devices and the corresponding solutions(papers) in the following sections. If you have any papers to add, feel free to ping me([email protected]).

Papers targeting inter-pipeline heterogeneity(each pipeline contains homogeneous devices, different pipelines have heterogeneous devices):

Main problem to solve: inter-pipeline heterogeneity leads to load imbalance.

Papers using batch distribution to balance the workload among pipelines

Jang, Insu, et al. "Oobleck: Resilient distributed training of large models using pipeline templates." Proceedings of the 29th Symposium on Operating Systems Principles. 2023. (Citations: 14) - Although this paper is not targeting heterogeneous cluster, section 4.2.2 provides a formulation of static optimal batch distribution for heterogeneous pipelines with homogeneous devices(eg. pipeline A has 3 nodes, pipeline B has 4 nodes)
Jia, Xianyan, et al. "Whale: Efficient giant model training over heterogeneous {GPUs}." 2022 USENIX Annual Technical Conference (USENIX ATC 22). 2022. (Citations: 33) - See section 3.31 for dynamic workload(mini-batch) shifting
Li, Dacheng, et al. "Amp: Automatically finding model parallel strategies with heterogeneity awareness." Advances in Neural Information Processing Systems 35 (2022): 6630-6639. (Citations: 7) - See section 3.6 on statically enumerating mini-batch for each pipeline

Papers using decentralized synchronization to improve overall throughput

Papers targeting intra-pipeline heterogeneity(A pipeline contains heterogeneous devices):

Main problem to solve: Within a pipeline, optimal layer assignment problem on heterogeneous devices is NP-hard with respective to the number of device types.

Park, Jay H., et al. "{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism." 2020 USENIX Annual Technical Conference (USENIX ATC 20). 2020. (Citations: 122) - See section 7 for HetPipe's intra-pipeline layer partition algorithm(ILP)
Liu, Ji, et al. "Heterps: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments." Future Generation Computer Systems 148 (2023): 106-117. (Citations: 31) - This paper uses reinforcement learning to select device for every layer

Other papers targeting heterogeneous cluster:

Xue, Chunyu, et al. "A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters." arXiv preprint arXiv:2403.16125 (2024). (Citations: 0) - Scheduler and Parallelization Codesign
Xu, Si, et al. "HetHub: A Heterogeneous distributed hybrid training system for large-scale models." arXiv preprint arXiv:2405.16256 (2024). (Citations: 0) - 3D parallelism strategies on heterogeneous cluster
Mei, Yixuan, et al. "Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs." arXiv preprint arXiv:2406.01566 (2024). (Citations: 0) - CMU

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
inter_pipeline_hetero_batch.txt		inter_pipeline_hetero_batch.txt
inter_pipeline_hetero_decentralize.txt		inter_pipeline_hetero_decentralize.txt
intra_pipeline_hetero.txt		intra_pipeline_hetero.txt
others.txt		others.txt
process.py		process.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Background and Motivation

Papers targeting inter-pipeline heterogeneity(each pipeline contains homogeneous devices, different pipelines have heterogeneous devices):

Papers targeting intra-pipeline heterogeneity(A pipeline contains heterogeneous devices):

Other papers targeting heterogeneous cluster:

About

Releases

Packages

Languages

9Tempest/awesome-ML-heterogeneous-gpu-papers

Folders and files

Latest commit

History

Repository files navigation

Background and Motivation

Papers targeting inter-pipeline heterogeneity(each pipeline contains homogeneous devices, different pipelines have heterogeneous devices):

Papers targeting intra-pipeline heterogeneity(A pipeline contains heterogeneous devices):

Other papers targeting heterogeneous cluster:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages