Open
Description
Not sure of the problematic math, but get_partitions
will error out if num_canonical_nodes / num_physical_nodes
is not a whole number. This could be resolved by making the default conditional, i.e
pn=num_physical_nodes
num_canonical_nodes = num_canonical_nodes or 120 // pn * pn + pn
Example I saw when attempting to train a 350M gpt example on 6 nodes:
get_partitions(
num_samples=364672,
num_canonical_nodes=128,
num_physical_nodes=6,
ranks_per_node=4,
workers_per_rank=1,
batch_size=6
)
# =>ValueError: cannot reshape array of size 364672 into shape (6)
Metadata
Metadata
Assignees
Labels
No labels