Description
Describe the bug
When calculating the bandwidth with TP in get_latency_fwd_per_tp_comm
and get_latency_fwd_per_layer_shared_dp_comm
, the calculation defaults to intra-node BW in the former and the latter depends on a magic number 8
which I assume is referring to NUM_GPUS_PER_NODE
.
llm-analysis/llm_analysis/analysis.py
Lines 1221 to 1223 in d841e40
llm-analysis/llm_analysis/analysis.py
Lines 1247 to 1250 in d841e40
llm-analysis/llm_analysis/constant.py
Line 37 in d841e40
Expected behavior
For get_latency_fwd_per_tp_comm
, it should use get_intra_node_bandwidth
when tp_size <= NUM_GPUS_PER_NODE
and get_inter_node_bandwidth
otherwise.
For get_latency_fwd_per_layer_shared_dp_comm
, the magic number 8
should be replaced with NUM_GPUS_PER_NODE
.
Looking at training
, the tp_size <= NUM_GPUS_PER_NODE
seems more like an enforcement than suggestion, should it be checked for infer
as well?
llm-analysis/llm_analysis/analysis.py
Lines 2695 to 2699 in d841e40
Additional context
I'd be more than happy to provide a PR if the report is valid
Minor Issue
A default for mlp_gated_linear_units
is not set when it hits the first if and misses the second. Can be reproduced with
python3 -m llm_analysis.analysis infer -m meta-llama/Llama-3.1-405b
llm-analysis/llm_analysis/config.py
Lines 216 to 221 in d841e40