-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor llm perf backend handling #258
Closed
Closed
Changes from all commits
Commits
Show all changes
79 commits
Select commit
Hold shift + click to select a range
92a5cea
add intel pytorch ort and openvino to leaderboard
0168063
add intel pytorch ort and openvino to leaderboard
baptistecolle 0bc416f
Add support for intel in leaderboard
baptistecolle 85f62e6
Update update_llm_perf_intel_pytorch.yml
baptistecolle 7151e01
Update update_llm_perf_intel_pytorch.yml
baptistecolle 4afc529
Merge branch 'add-intel-hardware-to-leaderboard' into intel-leaderboard
baptistecolle c92f818
add new llm_perf_tests
baptistecolle c31e6cf
fix workflow
baptistecolle d406440
fix failing tests
baptistecolle 20b96b2
fix failing tests
baptistecolle c7e0ec0
fix failing tests
baptistecolle 6d7bf69
fix failing tests
baptistecolle 7048df5
refractoring
baptistecolle db88b2a
intel with multiple backends
baptistecolle 1246d28
parallelize intel llm-perf
baptistecolle 2d6830e
parallelize intel llm-perf
baptistecolle 801c5bf
parallelize intel llm-perf
baptistecolle 2e9526c
parallelize intel llm-perf
baptistecolle 6d87d31
parallelize intel llm-perf
baptistecolle 62266a6
parallelize intel llm-perf
baptistecolle 0a39667
parallelize intel llm-perf
baptistecolle caf7b67
parallelize intel llm-perf
baptistecolle 5890457
parallelize intel llm-perf
baptistecolle 50bd1a2
update leaderboard collection to support more hardware
baptistecolle f93cc7c
update leaderboard collection to support more hardware
baptistecolle 2fad593
update leaderboard collection to support more hardware
baptistecolle 6f2885c
update leaderboard collection to support more hardware
baptistecolle a59e554
update leaderboard collection to support more hardware
baptistecolle a193748
update leaderboard collection to support more hardware
baptistecolle 31f1ff6
update leaderboard collection to support more hardware
baptistecolle 9d88f5a
Merge branch 'main' into intel-leaderboard
IlyasMoutawwakil 0f041fb
add new workflow
baptistecolle b2330b0
add new workflow
baptistecolle 2f54e2d
add new workflow
baptistecolle 8603b68
add new workflow
baptistecolle ec829cb
add new workflow
baptistecolle d152c81
add new workflow
baptistecolle 9730b0b
add new workflow
baptistecolle 6e9d33c
add new workflow
baptistecolle 540af0a
add new workflow
baptistecolle 452e4b0
add new workflow
baptistecolle b25d6e1
add new workflow
baptistecolle a76e56d
add new workflow
baptistecolle 6677def
add new workflow
baptistecolle 6593487
add new workflow
baptistecolle 9802c95
add new workflow
baptistecolle b6b947f
add new workflow
baptistecolle 7a891c1
add new workflow
baptistecolle a6f289b
remove intel reference
baptistecolle e97ee56
remove intel reference
baptistecolle f5f0eeb
remove intel reference
baptistecolle 55e2c69
refractoring done
baptistecolle ae7b939
refractoring done
baptistecolle 5c80cad
refractoring done
baptistecolle e75a361
refractoring done
baptistecolle 07d1d32
refractoring done
baptistecolle 34f958f
refractoring done
baptistecolle 35dc1cf
refractoring done
baptistecolle 9348515
refractoring done
baptistecolle 8b28005
refractoring done
baptistecolle 7cb3ea0
remove push on workflow used for debugging
baptistecolle e20ac80
Merge branch 'main' into refactor-llm-perf-backend-handling
baptistecolle c4c8887
refractor pytorch cpu
baptistecolle 32626f9
refractor pytorch cpu
baptistecolle 99a00df
refractor pytorch cpu
baptistecolle b27f806
fix failling workflow
baptistecolle 10c47ea
fix broken canonical list
baptistecolle 60aa33e
fix broken canonical list
baptistecolle 842645e
Merge branch 'fix-broken-canonical-list' into refactor-llm-perf-backe…
baptistecolle d0804a7
Merge branch 'main' into refactor-llm-perf-backend-handling
baptistecolle f3bc069
merge main
baptistecolle 602a9d0
merge main into branch
baptistecolle b2d5f12
merge main into branch
baptistecolle 08f70e2
merge main into branch
baptistecolle ab1710a
merge main into branch
baptistecolle 2512827
add new label system
baptistecolle defc78a
add new label system
baptistecolle 89b6a97
add new chnages from review
baptistecolle 3130c87
add new chnages from review
baptistecolle File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
from itertools import product | ||
from typing import Any, Dict, List | ||
|
||
from llm_perf.common.benchmark_runner import LLMPerfBenchmarkManager | ||
from llm_perf.common.utils import CANONICAL_PRETRAINED_OPEN_LLM_LIST, GENERATE_KWARGS, INPUT_SHAPES | ||
from optimum_benchmark import PyTorchConfig | ||
from optimum_benchmark.benchmark.config import BenchmarkConfig | ||
from optimum_benchmark.launchers.process.config import ProcessConfig | ||
from optimum_benchmark.scenarios.inference.config import InferenceConfig | ||
|
||
|
||
class CPUPyTorchBenchmarkRunner(LLMPerfBenchmarkManager): | ||
def __init__(self): | ||
super().__init__(backend="pytorch", device="cpu") | ||
|
||
self.attention_configs = self._get_attention_configs() | ||
assert self.subset is not None, "SUBSET environment variable must be set for benchmarking" | ||
self.weights_configs = self._get_weights_configs(self.subset) | ||
|
||
def get_list_of_benchmarks_to_run(self) -> List[Dict[str, Any]]: | ||
return [ | ||
{"model": model, "attn_implementation": attn_impl, "weights_config": weights_cfg} | ||
for model, attn_impl, weights_cfg in product( | ||
CANONICAL_PRETRAINED_OPEN_LLM_LIST, self.attention_configs, self.weights_configs.keys() | ||
) | ||
] | ||
|
||
def get_benchmark_name(self, model: str, **kwargs) -> str: | ||
weights_config = kwargs["weights_config"] | ||
attn_implementation = kwargs["attn_implementation"] | ||
return f"{model}-{weights_config}-{attn_implementation}" | ||
|
||
def get_benchmark_config(self, model: str, **kwargs) -> BenchmarkConfig: | ||
weights_config = kwargs["weights_config"] | ||
attn_implementation = kwargs["attn_implementation"] | ||
|
||
assert ( | ||
weights_config in self.weights_configs | ||
), f"your config does not contain {weights_config}, adjust your _get_weights_configs to fix this issue" | ||
|
||
torch_dtype = self.weights_configs[weights_config]["torch_dtype"] | ||
quant_scheme = self.weights_configs[weights_config]["quant_scheme"] | ||
quant_config = self.weights_configs[weights_config]["quant_config"] | ||
|
||
launcher_config = ProcessConfig() | ||
scenario_config = InferenceConfig( | ||
memory=True, | ||
energy=True, | ||
latency=True, | ||
duration=10, | ||
iterations=10, | ||
warmup_runs=10, | ||
input_shapes=INPUT_SHAPES, | ||
generate_kwargs=GENERATE_KWARGS, | ||
) | ||
backend_config = PyTorchConfig( | ||
model=model, | ||
device="cpu", | ||
no_weights=True, | ||
library="transformers", | ||
task="text-generation", | ||
torch_dtype=torch_dtype, | ||
quantization_scheme=quant_scheme, | ||
quantization_config=quant_config, | ||
attn_implementation=attn_implementation, | ||
model_kwargs={"trust_remote_code": True}, | ||
) | ||
|
||
return BenchmarkConfig( | ||
name=f"{weights_config}-{attn_implementation}", | ||
scenario=scenario_config, | ||
launcher=launcher_config, | ||
backend=backend_config, | ||
) | ||
|
||
def _get_weights_configs(self, subset) -> Dict[str, Dict[str, Any]]: | ||
if subset == "unquantized": | ||
return { | ||
"float32": {"torch_dtype": "float32", "quant_scheme": None, "quant_config": {}}, | ||
"float16": {"torch_dtype": "float16", "quant_scheme": None, "quant_config": {}}, | ||
"bfloat16": {"torch_dtype": "bfloat16", "quant_scheme": None, "quant_config": {}}, | ||
} | ||
else: | ||
raise ValueError(f"Unknown subset: {subset}") | ||
|
||
def _get_attention_configs(self) -> List[str]: | ||
return ["eager", "sdpa"] | ||
|
||
|
||
if __name__ == "__main__": | ||
runner = CPUPyTorchBenchmarkRunner() | ||
runner.run_benchmarks() |
147 changes: 147 additions & 0 deletions
147
llm_perf/benchmark_runners/update_llm_perf_cuda_pytorch.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
from itertools import product | ||
from typing import Any, Dict, List | ||
|
||
from llm_perf.common.benchmark_runner import LLMPerfBenchmarkManager | ||
from llm_perf.common.utils import CANONICAL_PRETRAINED_OPEN_LLM_LIST, GENERATE_KWARGS, INPUT_SHAPES | ||
from optimum_benchmark import PyTorchConfig | ||
from optimum_benchmark.benchmark.config import BenchmarkConfig | ||
from optimum_benchmark.launchers.process.config import ProcessConfig | ||
from optimum_benchmark.scenarios.inference.config import InferenceConfig | ||
|
||
|
||
class CUDAPyTorchBenchmarkRunner(LLMPerfBenchmarkManager): | ||
def __init__(self): | ||
super().__init__(backend="pytorch", device="cuda") | ||
|
||
self.attention_configs = self._get_attention_configs() | ||
assert self.subset is not None, "SUBSET environment variable must be set for benchmarking" | ||
self.weights_configs = self._get_weights_configs(self.subset) | ||
|
||
def get_list_of_benchmarks_to_run(self) -> List[Dict[str, Any]]: | ||
return [ | ||
{"model": model, "attn_implementation": attn_impl, "weights_config": weights_cfg} | ||
for model, attn_impl, weights_cfg in product( | ||
CANONICAL_PRETRAINED_OPEN_LLM_LIST, self.attention_configs, self.weights_configs.keys() | ||
) | ||
] | ||
|
||
def get_benchmark_name(self, model: str, **kwargs) -> str: | ||
weights_config = kwargs["weights_config"] | ||
attn_implementation = kwargs["attn_implementation"] | ||
return f"{model}-{weights_config}-{attn_implementation}" | ||
|
||
def is_benchmark_supported(self, **kwargs) -> bool: | ||
if kwargs["attn_implementation"] == "flash_attention_2" and kwargs["weights_config"] == "float32": | ||
return False | ||
return True | ||
|
||
def get_benchmark_config(self, model: str, **kwargs) -> BenchmarkConfig: | ||
weights_config = kwargs["weights_config"] | ||
attn_implementation = kwargs["attn_implementation"] | ||
|
||
assert ( | ||
weights_config in self.weights_configs | ||
), f"your config does contains the {weights_config}, adjust your _get_weights_configs to fix this issue" | ||
|
||
torch_dtype = self.weights_configs[weights_config]["torch_dtype"] | ||
quant_scheme = self.weights_configs[weights_config]["quant_scheme"] | ||
quant_config = self.weights_configs[weights_config]["quant_config"] | ||
|
||
launcher_config = ProcessConfig(device_isolation=True, device_isolation_action="kill") | ||
scenario_config = InferenceConfig( | ||
memory=True, | ||
energy=True, | ||
latency=True, | ||
duration=10, | ||
iterations=10, | ||
warmup_runs=10, | ||
input_shapes=INPUT_SHAPES, | ||
generate_kwargs=GENERATE_KWARGS, | ||
) | ||
backend_config = PyTorchConfig( | ||
model=model, | ||
device="cuda", | ||
device_ids="0", | ||
no_weights=True, | ||
library="transformers", | ||
task="text-generation", | ||
torch_dtype=torch_dtype, | ||
quantization_scheme=quant_scheme, | ||
quantization_config=quant_config, | ||
attn_implementation=attn_implementation, | ||
model_kwargs={"trust_remote_code": True}, | ||
) | ||
|
||
return BenchmarkConfig( | ||
name=f"{weights_config}-{attn_implementation}", | ||
scenario=scenario_config, | ||
launcher=launcher_config, | ||
backend=backend_config, | ||
) | ||
|
||
def _get_weights_configs(self, subset) -> Dict[str, Dict[str, Any]]: | ||
if subset == "unquantized": | ||
return { | ||
"float32": {"torch_dtype": "float32", "quant_scheme": None, "quant_config": {}}, | ||
"float16": {"torch_dtype": "float16", "quant_scheme": None, "quant_config": {}}, | ||
"bfloat16": {"torch_dtype": "bfloat16", "quant_scheme": None, "quant_config": {}}, | ||
} | ||
elif subset == "bnb": | ||
return { | ||
"4bit-bnb": {"torch_dtype": "float16", "quant_scheme": "bnb", "quant_config": {"load_in_4bit": True}}, | ||
"8bit-bnb": {"torch_dtype": "float16", "quant_scheme": "bnb", "quant_config": {"load_in_8bit": True}}, | ||
} | ||
elif subset == "gptq": | ||
return { | ||
"4bit-gptq-exllama-v1": { | ||
"torch_dtype": "float16", | ||
"quant_scheme": "gptq", | ||
"quant_config": {"bits": 4, "use_exllama ": True, "version": 1, "model_seqlen": 256}, | ||
}, | ||
"4bit-gptq-exllama-v2": { | ||
"torch_dtype": "float16", | ||
"quant_scheme": "gptq", | ||
"quant_config": {"bits": 4, "use_exllama ": True, "version": 2, "model_seqlen": 256}, | ||
}, | ||
} | ||
elif subset == "awq": | ||
return { | ||
"4bit-awq-gemm": { | ||
"torch_dtype": "float16", | ||
"quant_scheme": "awq", | ||
"quant_config": {"bits": 4, "version": "gemm"}, | ||
}, | ||
"4bit-awq-gemv": { | ||
"torch_dtype": "float16", | ||
"quant_scheme": "awq", | ||
"quant_config": {"bits": 4, "version": "gemv"}, | ||
}, | ||
"4bit-awq-exllama-v1": { | ||
"torch_dtype": "float16", | ||
"quant_scheme": "awq", | ||
"quant_config": { | ||
"bits": 4, | ||
"version": "exllama", | ||
"exllama_config": {"version": 1, "max_input_len": 64, "max_batch_size": 1}, | ||
}, | ||
}, | ||
"4bit-awq-exllama-v2": { | ||
"torch_dtype": "float16", | ||
"quant_scheme": "awq", | ||
"quant_config": { | ||
"bits": 4, | ||
"version": "exllama", | ||
"exllama_config": {"version": 2, "max_input_len": 64, "max_batch_size": 1}, | ||
}, | ||
}, | ||
} | ||
else: | ||
raise ValueError(f"Unknown subset: {subset}") | ||
|
||
def _get_attention_configs(self) -> List[str]: | ||
return ["eager", "sdpa", "flash_attention_2"] | ||
|
||
|
||
if __name__ == "__main__": | ||
runner = CUDAPyTorchBenchmarkRunner() | ||
runner.run_benchmarks() |
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can probably add more specifications here to be able to run specific benchmarks, like
cuda
/cpu
didn't try it, but you might also be able to add conditions on matrix arguments, like
|| contains( github.event.pull_request.labels.*.name, matrix.subset)}}
to run specific subsets or specific machines