Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NUM_GPUS_PER_NODE not respected in inference #30

Open
averageFOSSenjoyer opened this issue Jan 25, 2025 · 0 comments
Open

[BUG] NUM_GPUS_PER_NODE not respected in inference #30

averageFOSSenjoyer opened this issue Jan 25, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@averageFOSSenjoyer
Copy link

Describe the bug
When calculating the bandwidth with TP in get_latency_fwd_per_tp_comm and get_latency_fwd_per_layer_shared_dp_comm, the calculation defaults to intra-node BW in the former and the latter depends on a magic number 8 which I assume is referring to NUM_GPUS_PER_NODE.

# assuming tp_size <= number of GPUs per node, thus using intra-node bandwidth
latency_per_all_reduce = (elems_per_all_reduce * dtype_bytes /
(self.get_intra_node_bandwidth() * 10**9))

# assuming tp and dp are preferred when sharding intra node, pp is only applied across nodes
# when (dp_size * tp_size) <= 8, the data parallel processes are within a node
bandwidth = self.get_intra_node_bandwidth() if (
dp_size * tp_size) <= 8 else self.get_inter_node_bandwidth()

NUM_GPUS_PER_NODE = 8 # number of GPUs per node

Expected behavior
For get_latency_fwd_per_tp_comm, it should use get_intra_node_bandwidth when tp_size <= NUM_GPUS_PER_NODE and get_inter_node_bandwidth otherwise.

For get_latency_fwd_per_layer_shared_dp_comm, the magic number 8 should be replaced with NUM_GPUS_PER_NODE.

Looking at training, the tp_size <= NUM_GPUS_PER_NODE seems more like an enforcement than suggestion, should it be checked for infer as well?

assert tp_size <= num_gpus_per_node, (
f"tp_size must be <= {num_gpus_per_node}(num_gpus_per_node), tensor"
" parallelism requires high communication bandwidth to be efficient"
" and is best kept within a single node where high bandwidth NVLink"
" is available.")

Additional context
I'd be more than happy to provide a PR if the report is valid

Minor Issue
A default for mlp_gated_linear_units is not set when it hits the first if and misses the second. Can be reproduced with

python3 -m llm_analysis.analysis infer -m meta-llama/Llama-3.1-405b

if ffn_embed_dim:
expansion_ratio = ffn_embed_dim / hidden_dim
if expansion_ratio == 3.5:
mlp_gated_linear_units = True
else:
mlp_gated_linear_units = False

@averageFOSSenjoyer averageFOSSenjoyer added the bug Something isn't working label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant