Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default num_canonical_nodes to an even multiple of num_physical_nodes #215

Open
micimize opened this issue Mar 8, 2023 · 3 comments
Open

Comments

@micimize
Copy link

micimize commented Mar 8, 2023

Not sure of the problematic math, but get_partitions will error out if num_canonical_nodes / num_physical_nodes is not a whole number. This could be resolved by making the default conditional, i.e

pn=num_physical_nodes
num_canonical_nodes = num_canonical_nodes or 120 // pn * pn + pn

Example I saw when attempting to train a 350M gpt example on 6 nodes:

get_partitions(
    num_samples=364672,
    num_canonical_nodes=128,
    num_physical_nodes=6,
    ranks_per_node=4,
    workers_per_rank=1,
    batch_size=6
)
# =>ValueError: cannot reshape array of size 364672 into shape (6)
@karan6181
Copy link
Contributor

Thanks @micimize for raising this. The error message originates from streaming repository and it's not descriptive enough to let the user know what the actual issue is. The streaming repository will fix this with a better error message in the upcoming release.

@micimize
Copy link
Author

micimize commented Mar 9, 2023

@karan6181 that would improve things but I'm also wonder my approach for defaulting canonical nodes would be better than hardcoded 128? It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason

num_canonical_nodes: Optional[int] = 128,

@knighton
Copy link
Contributor

knighton commented Mar 9, 2023

It's not really clear to me what the parameter does beyond that a high number is important for the improved algorithm for some reason

Canonical nodes is how many nodes you partition the sample space over. This stays the same even if your physical nodes changes. It is used to create an elastically deterministic sample order.

Your samples get laid out according to canonical nodes and then folded over onto physical nodes, so they have to be an even multiple of each other, or else you would get weird interleaving/striping of shards across nodes that would result in all shards being downloaded to all nodes, which is very bad and non-obvious.

To see the impact of various changes in parameters to get_partitions, you can visualize it using this script:

git clone https://github.com/mosaicml/streaming/
cd streaming/
pip3 install --user -e ".[dev]"
make web &
open http://localhost:1337/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants