-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama.cpp backend #231
Conversation
added llama.cpp backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot ! Very awesome work !
Only missing test configs and GitHub workflows 🤗
Hopefully I fix the process launcher with mps by the time the mps workflows start running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean PR!
I left a couple comments.
Additionally, since benchmarking for llama.cpp is limited to a batch size of 1, I think it may be good to add a comment right above the batch size field in the 3 example configs. Unless an error/warning is raised somewhere in the code but I didn't see it.
It is done here |
Thanks for the review. I implemented the necessary changes I added:
(also two of the runners are currently offline, so i am unable to run the CI on them)
Let me know if you have more remarks |
one or two examples are enough, there's a lot of repetition there |
Indeed, I created multiple config during development and forgot to remove it from the pr. I fixed it now |
Great PR @baptistecolle 🤗 |
Add llama.cpp as Backend for Optimum Benchmark
Overview
This PR introduces
llama.cpp
as a backend for the Optimum benchmark (see issue #117).Changes
examples
folder demonstrating how to run thellama.cpp
backend:Current limitations:
llama-cpp-python
binding.Performance
The metrics were tested by comparing the results of the benchmark with a PyTorch backend and a llama.cpp backend. The performance results are close, and I can provide the full .json files if needed. Due to their low readability, they are not included here directly.
CLI output: (tested on M3 Pro CPU)
Performance with llama.cpp backend
Performance with the pytorch backend
The performance difference might be due to the significant amount of copying between devices with PyTorch, as shown below:
Furthermore, llama.cpp is optimized for Mac, which could explain the higher performance. Let me know if you want me to investigate the performance difference further