You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now, as a follow-up of #1630: a very nice next step/feature would be if we can use this sharding feature in general for any kind of multi GPU training. Similar as the horovod_dataset_distribution="shard" option we had for TF (which was implemented very inefficiently though, by just selecting every Nth seq from the dataset, i.e. the dataset still iterated through all the data). So maybe, to distinguish or making it more explicit where the sharding is done, we should not just call this "shard", but "dataset_sharding" or so.
We should also not reuse the horovod_dataset_distribution option (which is intended only for Horovod), but maybe generically distributed_dataset_distribution or so? Or it could be part of torch_distributed, just dataset_distribution in it? (In principle, we could reuse the feature later also for TF or other backend engines. But having it currently in torch_distributed is also fine.)
The dataset_distribution default would be "random_seed_offset" (i.e. like horovod_dataset_distribution="random_seed_offset"), which is the current behavior of PyTorch distributed training. (We could change the default via a new behavior version if we want to...)
(Also note, similar comment as I made in #1612: There are some implicit assumptions here: That the worker index and rank is static. This might not always be the case. But it might be possible to just update the shard index / num shards dynamically for the next sub-epoch. Just to keep this in mind. I don't think we need to take care of this now.)
The text was updated successfully, but these errors were encountered:
Now, as a follow-up of #1630: a very nice next step/feature would be if we can use this sharding feature in general for any kind of multi GPU training. Similar as the
horovod_dataset_distribution="shard"
option we had for TF (which was implemented very inefficiently though, by just selecting every Nth seq from the dataset, i.e. the dataset still iterated through all the data). So maybe, to distinguish or making it more explicit where the sharding is done, we should not just call this"shard"
, but"dataset_sharding"
or so.We should also not reuse the
horovod_dataset_distribution
option (which is intended only for Horovod), but maybe genericallydistributed_dataset_distribution
or so? Or it could be part oftorch_distributed
, justdataset_distribution
in it? (In principle, we could reuse the feature later also for TF or other backend engines. But having it currently intorch_distributed
is also fine.)The
dataset_distribution
default would be "random_seed_offset" (i.e. likehorovod_dataset_distribution="random_seed_offset"
), which is the current behavior of PyTorch distributed training. (We could change the default via a new behavior version if we want to...)(Also note, similar comment as I made in #1612: There are some implicit assumptions here: That the worker index and rank is static. This might not always be the case. But it might be possible to just update the shard index / num shards dynamically for the next sub-epoch. Just to keep this in mind. I don't think we need to take care of this now.)
The text was updated successfully, but these errors were encountered: