Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why llama.cpp runs substantially faster #17

Open
Ingvarstep opened this issue May 8, 2024 · 4 comments
Open

Why llama.cpp runs substantially faster #17

Ingvarstep opened this issue May 8, 2024 · 4 comments

Comments

@Ingvarstep
Copy link

I am thinking about creating DeBERTa version of this project. Initially I thought to use it as a backbone, because it's easier to modify than llama.cpp, but performance is really important for my case. It was mentioned in the readme that llama.cpp realization is substantially faster, I am a beginner of ggml and llama.cpp and I don't understand why. Can someone explain it?

@iamlemec
Copy link
Owner

iamlemec commented May 8, 2024

Hello! I haven't done a head-to-head benchmark actually, it's mostly just based on brief testing and some theoretical hunches. But basically, llama.cpp packs tokens from multiple sequences into a single batch, keeping track of their sequence IDs and masking attention appropriately. Meanwhile, bert.cpp here pads sequences to a common length. So for situations where your sequences are often less than your max sequence length, that is pretty wasteful. If that's not the case, you might end up with similar performance.

The other thing is just accuracy. It turns out that tokenization is really complicated, and llama.cpp is good at that and getting better over time. Especially as new models arise. Overally, I would say that llama.cpp isn't too much harder. Just look at build_bert in the llama.cpp main file and that's where all the computational content is. The rest is just adding to various enums.

@Ingvarstep
Copy link
Author

Hello! Thank you very much for your explanation; it answers most of my questions.

Regarding tokenization, I want to build DeBERTa v3, which uses a sentencepiece tokenizer, so I plan to use this realization from Google. Does llama.cpp have its own realization of sentencepiece tokenizer?

@iamlemec
Copy link
Owner

Yup, llama.cpp has it's own sentencepiece tokenizer. Though I'm not sure how well it works with embedding models, which have historically use more of a WordPiece tokenizer. I've been trying to get BAAI/bge-m3 to work for a while, which also use setencepiece, and there seem to be some subtle differences messing with the tokenization. That model is based off of RoBERTa, so maybe (hopefully) the DeBERTa branch will work better.

Anyway, let me know if you need any help with the DeBERTa implementation!

@Ingvarstep
Copy link
Author

I got it, thank you for the answer. Considering this, I think it is safer to use Google implementation right now.

And thank you for intention to help with the DeBERTa implementation, I will back to you if i will have some questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants