Sharding for multi-GPU training #1634

albertz · 2024-10-15T21:37:23Z

Now, as a follow-up of #1630: a very nice next step/feature would be if we can use this sharding feature in general for any kind of multi GPU training. Similar as the horovod_dataset_distribution="shard" option we had for TF (which was implemented very inefficiently though, by just selecting every Nth seq from the dataset, i.e. the dataset still iterated through all the data). So maybe, to distinguish or making it more explicit where the sharding is done, we should not just call this "shard", but "dataset_sharding" or so.

We should also not reuse the horovod_dataset_distribution option (which is intended only for Horovod), but maybe generically distributed_dataset_distribution or so? Or it could be part of torch_distributed, just dataset_distribution in it? (In principle, we could reuse the feature later also for TF or other backend engines. But having it currently in torch_distributed is also fine.)

The dataset_distribution default would be "random_seed_offset" (i.e. like horovod_dataset_distribution="random_seed_offset"), which is the current behavior of PyTorch distributed training. (We could change the default via a new behavior version if we want to...)

(Also note, similar comment as I made in #1612: There are some implicit assumptions here: That the worker index and rank is static. This might not always be the case. But it might be possible to just update the shard index / num shards dynamically for the next sub-epoch. Just to keep this in mind. I don't think we need to take care of this now.)

The text was updated successfully, but these errors were encountered:

albertz assigned NeoLegends Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding for multi-GPU training #1634

Sharding for multi-GPU training #1634

albertz commented Oct 15, 2024

Sharding for multi-GPU training #1634

Sharding for multi-GPU training #1634

Comments

albertz commented Oct 15, 2024