Pipeline Parallelism in `Stacked` #1

dlwh · 2023-06-06T21:22:08Z

I think it's not too bad to implement pipeline parallelism directly in Stacked. The basic idea is that we map the Layers axis of a Stacked to a (new) physical axis (called stage here and in the link), then we reshape our batch into microbatches and push through the pipeline.

Example implementation https://github.com/tensorflow/lingvo/blob/master/lingvo/jax/layers/pipeline.py (which looks a lot like accumulate_gradients_sharded)

The biggest thing that's not clear to me is partitioning of the (macro) batch itself. Easiest thing to do is replicate it across the stage axis, but i think that's not ideal. should take a look at an impl of pipeline parallelism

The text was updated successfully, but these errors were encountered:

dlwh · 2023-06-19T19:26:55Z

https://github.com/google/praxis/blob/main/praxis/layers/pipeline.py

(Googlers are telling me PP is a waste of time on TPU until you cross node boundary)

dlwh transferred this issue from stanford-crfm/levanter Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline Parallelism in `Stacked` #1

Pipeline Parallelism in `Stacked` #1

dlwh commented Jun 6, 2023

dlwh commented Jun 19, 2023

Pipeline Parallelism in Stacked #1

Pipeline Parallelism in Stacked #1

Comments

dlwh commented Jun 6, 2023

dlwh commented Jun 19, 2023

Pipeline Parallelism in `Stacked` #1

Pipeline Parallelism in `Stacked` #1