Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using llama.cpp #14

Open
PrithivirajDamodaran opened this issue Feb 28, 2024 · 3 comments
Open

Using llama.cpp #14

PrithivirajDamodaran opened this issue Feb 28, 2024 · 3 comments

Comments

@PrithivirajDamodaran
Copy link

I am trying to use llama.cpp as you suggested its merged there for the same baai 1.5 embedding models , could you please help me how should I get started. I cant figure out the equivalent of bert_tokenize part there.

Thanks

@iamlemec
Copy link
Owner

Hi @PrithivirajDamodaran! Sorry I missed your last issue. Forgot to turn on notifications for this repo.

For everyday stuff I just use the llama-cpp-python bindings. Here's an example of how to do embeddings: https://github.com/abetlen/llama-cpp-python#embeddings. You can also just use Llama.embed if you want to get the raw embedding as a list. (Note: there's a typo in the example, it should be embedding=True not embeddings=True)

As for the MLM issue. Right now, llama.cpp doesn't suppor that MLM head part of the model. It'll only get you up to the D=768. embeddings. It's possible turn off pooling for the embeddings and then just fetch the token level embeddings manually. You can do that in raw llama.cpp but the that option (do_pooling=False) hasn't found its way into llama-cpp-python yet. I'm thinking about making a PR for that today, which should hopefully be merged soon.

@PrithivirajDamodaran
Copy link
Author

PrithivirajDamodaran commented Feb 29, 2024

Hey @iamlemec , Thanks for taking the time out and all the awesome work you are doing.

  • I was interested to know about llama.cpp Bert merge because you mentioned in the defunct notice "it's way faster". Will see the code but it will be easier to know what were the optimisations :)

  • Thanks adding the pooling flag, it will be useful if we need access to the raw embeddings. But MLM head is just a combination of GeLU and linear layers. I am not fully acquainted ggml APIs, will see how best I can add that from side as well.

  • Also I am looking for bare metal performance so python bindings currently aren't in my radar.

As we speak I am working on a fork for the community to take full advantage of all the awesome work that been done in this space🙏. Will share more soon.

Cheers
Prithivi

@iamlemec
Copy link
Owner

iamlemec commented Mar 2, 2024

@PrithivirajDamodaran Looks cool! Yeah, so I haven't done benchmarks in a bit, but the main reason it should be faster is that llama.cpp packs different sequences together in a single batch, while here we pad the sequences to the same length and makes a big square batch. This is pretty inefficient when sequenes are coming in with widely varying lengths. There are probably some other reasons, but I think that's the main one.

Yup, pooling options are great, epecially with some of the new approaches coming out like GritLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants