-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default num_canonical_nodes to an even multiple of num_physical_nodes #215
Comments
@karan6181 that would improve things but I'm also wonder my approach for defaulting canonical nodes would be better than hardcoded 128? It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason examples/examples/common/text_data.py Line 61 in 132ec02
|
Canonical nodes is how many nodes you partition the sample space over. This stays the same even if your physical nodes changes. It is used to create an elastically deterministic sample order. Your samples get laid out according to canonical nodes and then folded over onto physical nodes, so they have to be an even multiple of each other, or else you would get weird interleaving/striping of shards across nodes that would result in all shards being downloaded to all nodes, which is very bad and non-obvious. To see the impact of various changes in parameters to
|
Not sure of the problematic math, but
get_partitions
will error out ifnum_canonical_nodes / num_physical_nodes
is not a whole number. This could be resolved by making the default conditional, i.eExample I saw when attempting to train a 350M gpt example on 6 nodes:
The text was updated successfully, but these errors were encountered: