Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does bert_encode() thread-safe for online embedding? #11

Open
WayneCao opened this issue Feb 18, 2024 · 5 comments
Open

does bert_encode() thread-safe for online embedding? #11

WayneCao opened this issue Feb 18, 2024 · 5 comments

Comments

@WayneCao
Copy link

No description provided.

@WayneCao
Copy link
Author

I found that different invocation shares same memory buffer in bert_context, it may not be thread-safe for online-embedding situation

@iamlemec
Copy link
Owner

Yup, that seems right. Good news is that we got merged into llama.cpp, which has multi-threading support. Check it out over there!

@WayneCao
Copy link
Author

Yup, that seems right. Good news is that we got merged into llama.cpp, which has multi-threading support. Check it out over there!

Can you help explain the implementation mechanism?

@iamlemec
Copy link
Owner

Sure! The major difference from this one is the way that batching works. Here we have explicit batch sizes for each sequence, and so we need to pad them to alignment. In the llama.cpp implemenation, batches are essentially lists of (sequence_id, position, token_id) pairs, so you can put multiple sequences in one batch without padding, which can be really good for uneven length settings. The bulk of the new code there is in llama.cpp:build_bert() if you want to go into more detail.

Is that what you were looking for? Happy to provide more specifics.

@WayneCao
Copy link
Author

Sure! The major difference from this one is the way that batching works. Here we have explicit batch sizes for each sequence, and so we need to pad them to alignment. In the llama.cpp implemenation, batches are essentially lists of (sequence_id, position, token_id) pairs, so you can put multiple sequences in one batch without padding, which can be really good for uneven length settings. The bulk of the new code there is in llama.cpp:build_bert() if you want to go into more detail.

thank you so much! This seems only support multi-thread in batch?
Let me briefly state my question. I want to wrapper llama.cpp into a online-embedding service, when concurrent client request comes, llama.cpp:build_bert() seems not thread-safe between different invocations, i haven‘t figured out how to guarantee the memory-safety in llama_context for different invocations and found any read-write lock around build_bert?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants