New DataPartitionType DATA #567

apoorvtintin · 2024-07-01T20:20:10Z

Increases memory efficiency during large scale training, input batches and labels are sharded along the 'data' axis.
Added new input data sharding option DataPartitionType.DATA.

kelvin-zou · 2024-07-03T15:12:25Z

axlearn/common/utils.py

+    # Data are fully replicated across all devices.
+    REPLICATED = "replicated"
+    # Data are partially partitioned across data axis
+    DATA = "data"


A high level question, what is the purpose of this change?
I see that we have FULL partition support already, which partitions on axis=0 which is the data axis, how is DATA different from FULL?

DATA replicates over the sequence dimension. so the spec is ("data", None) versus ("data", "model") for FULL

ruomingp

Increases memory efficiency

Do you have measurements on how DATA improves memory efficiency? Thanks.

ptoulme-aws · 2024-07-08T22:42:51Z

Increases memory efficiency

Do you have measurements on how DATA improves memory efficiency? Thanks.

By replicating the sequence length over TP workers we limit collectives and dynamic-slices introduced by the SPMD partitioner. This lowers overall step time and also allows us to run sequence parallelism over TP workers.

ruomingp · 2024-07-08T23:33:51Z

Increases memory efficiency

Do you have measurements on how DATA improves memory efficiency? Thanks.

Increases memory efficiency

Do you have measurements on how DATA improves memory efficiency? Thanks.

By replicating the sequence length over TP workers we limit collectives and dynamic-slices introduced by the SPMD partitioner. This lowers overall step time and also allows us to run sequence parallelism over TP workers.

Thanks. Do you have quantitative measurements?

ptoulme-aws · 2024-07-09T17:05:13Z

Increases memory efficiency

Do you have measurements on how DATA improves memory efficiency? Thanks.

Increases memory efficiency

Do you have measurements on how DATA improves memory efficiency? Thanks.

By replicating the sequence length over TP workers we limit collectives and dynamic-slices introduced by the SPMD partitioner. This lowers overall step time and also allows us to run sequence parallelism over TP workers.

Thanks. Do you have quantitative measurements?

No, we do not. It is more when we inspect the HLO after SPMD partition pass we see much more optimal sharding. Less all-to-alls and less dynamic-slices on right hand side.

Added new DataPartitionType DATA

511bc30

apoorvtintin requested review from ruomingp and markblee as code owners July 1, 2024 20:20

madrob force-pushed the main branch from 9471857 to cce635c Compare July 2, 2024 21:02

kelvin-zou reviewed Jul 3, 2024

View reviewed changes

ruomingp reviewed Jul 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New DataPartitionType DATA #567

New DataPartitionType DATA #567

apoorvtintin commented Jul 1, 2024 •

edited

Loading

kelvin-zou Jul 3, 2024

ptoulme-aws Jul 3, 2024

ruomingp left a comment

ptoulme-aws commented Jul 8, 2024

ruomingp commented Jul 8, 2024 •

edited

Loading

ptoulme-aws commented Jul 9, 2024

New DataPartitionType DATA #567

Are you sure you want to change the base?

New DataPartitionType DATA #567

Conversation

apoorvtintin commented Jul 1, 2024 • edited Loading

kelvin-zou Jul 3, 2024

Choose a reason for hiding this comment

ptoulme-aws Jul 3, 2024

Choose a reason for hiding this comment

ruomingp left a comment

Choose a reason for hiding this comment

ptoulme-aws commented Jul 8, 2024

ruomingp commented Jul 8, 2024 • edited Loading

ptoulme-aws commented Jul 9, 2024

apoorvtintin commented Jul 1, 2024 •

edited

Loading

ruomingp commented Jul 8, 2024 •

edited

Loading