Skip to content

Default num_canonical_nodes to an even multiple of num_physical_nodes #215

Open
@micimize

Description

@micimize

Not sure of the problematic math, but get_partitions will error out if num_canonical_nodes / num_physical_nodes is not a whole number. This could be resolved by making the default conditional, i.e

pn=num_physical_nodes
num_canonical_nodes = num_canonical_nodes or 120 // pn * pn + pn

Example I saw when attempting to train a 350M gpt example on 6 nodes:

get_partitions(
    num_samples=364672,
    num_canonical_nodes=128,
    num_physical_nodes=6,
    ranks_per_node=4,
    workers_per_rank=1,
    batch_size=6
)
# =>ValueError: cannot reshape array of size 364672 into shape (6)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions