Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with using start_rultisprocess_pool() #2955

Open
safwaqf opened this issue Sep 24, 2024 · 1 comment
Open

Problems with using start_rultisprocess_pool() #2955

safwaqf opened this issue Sep 24, 2024 · 1 comment

Comments

@safwaqf
Copy link

safwaqf commented Sep 24, 2024

Why do I encounter a situation where the sentence list does not match the encoding list when I use start_rultisprocess_pool() to start the process pool and then start Python multithreading
eg:
batchNum:1 queLen: 100, embLen: 98
batchNum:2 queLen: 100, embLen: 102
batchNum:3 queLen: 100, embLen: 102
batchNum:4 queLen: 100, embLen: 98
You can see that I output the sentence list length and encoding list length for four batches. Why did my first batch encode 2 sentences less, and the two sentences that were encoded less went to the second batch. Similarly, the third batch encoded two extra sentences, and the two extra encoded sentences ran to the fourth batch.

@tomaarsen
Copy link
Collaborator

Hello!

Do you start the Python multithreading yourself? That shouldn't be needed.
There's normally just 1 queue, and each process will continuously pop from that shared queue until it's empty. These processes will then also push to 1 shared output queue. This queue is sorted afterwards to ensure that we have the same order as the inputs, but we still have just 1 output queue.

So, the usage is:

from sentence_transformers import SentenceTransformer

def main():
    model = SentenceTransformer("all-mpnet-base-v2")
    sentences = ["The weather is so nice!", "It's so sunny outside.", "He's driving to the movie theater.", "She's going to the cinema."] * 1000

    pool = model.start_multi_process_pool()
    embeddings = model.encode_multi_process(sentences, pool)
    model.stop_multi_process_pool(pool)

    print(embeddings.shape)
    # => (4000, 768)

if __name__ == "__main__":
    main()

https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html?highlight=multi_process#sentence_transformers.SentenceTransformer.encode_multi_process

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants