Skip to content

Commit

Permalink
Merge pull request #233 from SymbioticLab/oobleck-update
Browse files Browse the repository at this point in the history
Update oobleck paper
  • Loading branch information
mosharaf authored Nov 8, 2023
2 parents 7502b1c + d44f13a commit d6c31fe
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions source/_data/SymbioticLab.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1545,7 +1545,7 @@ @Article{oobleck:arxiv23
publist_link = {paper || https://arxiv.org/abs/2309.08125},
publist_topic = {Systems + AI},
publist_abstract = {
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least f+1 logically equivalent pipeline replicas to tolerate any f simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after f or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 13.9x.
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least f+1 logically equivalent pipeline replicas to tolerate any f simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after f or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 29.6x.
}
}

Expand All @@ -1563,7 +1563,7 @@ @InProceedings{oobleck:sosp23
publist_badge = {Artifacts Functional},
publist_badge = {Results Reproduced},
publist_abstract = {
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least f+1 logically equivalent pipeline replicas to tolerate any f simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after f or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 13.9x.
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least f+1 logically equivalent pipeline replicas to tolerate any f simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after f or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to 29.6x.
}}


Expand Down
Binary file modified source/publications/files/oobleck:sosp23/oobleck-sosp23.pdf
Binary file not shown.

0 comments on commit d6c31fe

Please sign in to comment.