Why llama.cpp runs substantially faster #17

Ingvarstep · 2024-05-08T17:38:03Z

I am thinking about creating DeBERTa version of this project. Initially I thought to use it as a backbone, because it's easier to modify than llama.cpp, but performance is really important for my case. It was mentioned in the readme that llama.cpp realization is substantially faster, I am a beginner of ggml and llama.cpp and I don't understand why. Can someone explain it?

iamlemec · 2024-05-08T21:08:24Z

Hello! I haven't done a head-to-head benchmark actually, it's mostly just based on brief testing and some theoretical hunches. But basically, llama.cpp packs tokens from multiple sequences into a single batch, keeping track of their sequence IDs and masking attention appropriately. Meanwhile, bert.cpp here pads sequences to a common length. So for situations where your sequences are often less than your max sequence length, that is pretty wasteful. If that's not the case, you might end up with similar performance.

The other thing is just accuracy. It turns out that tokenization is really complicated, and llama.cpp is good at that and getting better over time. Especially as new models arise. Overally, I would say that llama.cpp isn't too much harder. Just look at build_bert in the llama.cpp main file and that's where all the computational content is. The rest is just adding to various enums.

Ingvarstep · 2024-05-09T08:21:56Z

Hello! Thank you very much for your explanation; it answers most of my questions.

Regarding tokenization, I want to build DeBERTa v3, which uses a sentencepiece tokenizer, so I plan to use this realization from Google. Does llama.cpp have its own realization of sentencepiece tokenizer?

iamlemec · 2024-05-10T20:49:02Z

Yup, llama.cpp has it's own sentencepiece tokenizer. Though I'm not sure how well it works with embedding models, which have historically use more of a WordPiece tokenizer. I've been trying to get BAAI/bge-m3 to work for a while, which also use setencepiece, and there seem to be some subtle differences messing with the tokenization. That model is based off of RoBERTa, so maybe (hopefully) the DeBERTa branch will work better.

Anyway, let me know if you need any help with the DeBERTa implementation!

Ingvarstep · 2024-05-11T17:30:25Z

I got it, thank you for the answer. Considering this, I think it is safer to use Google implementation right now.

And thank you for intention to help with the DeBERTa implementation, I will back to you if i will have some questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why llama.cpp runs substantially faster #17

Why llama.cpp runs substantially faster #17

Ingvarstep commented May 8, 2024

iamlemec commented May 8, 2024

Ingvarstep commented May 9, 2024

iamlemec commented May 10, 2024

Ingvarstep commented May 11, 2024

Why llama.cpp runs substantially faster #17

Why llama.cpp runs substantially faster #17

Comments

Ingvarstep commented May 8, 2024

iamlemec commented May 8, 2024

Ingvarstep commented May 9, 2024

iamlemec commented May 10, 2024

Ingvarstep commented May 11, 2024