-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semchunk vs text-splitter #507
Comments
Can you send me a copy of the code you are running? There is no reason you can't run this in parallel if it is for multiple documents, but I could document how. It's pretty trivial in Rust, but maybe the Python bindings could use a nicer interface. But it would help to see how you are calling it. Thanks! |
I'm using this code for benchmarking: bench.py. Additional documentation for the Python bindings would be super helpful! In my use case, I plan to periodically create chunks from batches of files as and when they're updated, which will run in a separate thread from the main application. So, parallelization would be really good if possible. |
Thanks @tensorsofthewall. It does indeed seem like a not apples to apples comparison, as it is doing one serially and one in parallel. It could be made better by using a threadpool, but maybe that is too advanced for most users. I'll do some testing with either a threadpool and document an example, or pull the threadpool in on the Rust side of the bindings. You aren't the first to ask for it, and I suspect the benchmarks will get much closer in this example. But we'll see |
Thanks @benbrandt! Either approach would work, ofcourse making changes within the Rust code would be better long-term. Pls let me know when you've been able to test either approach :) |
@tensorsofthewall I did a test here: https://github.com/benbrandt/semchunk/blob/parallel-benchmark/tests/bench.py#L26-L28 It cut down by 30% or so, but still slower. I will test it out with a threadpool in Rust over the weekend. I suspect the GIL might be playing a role here and we don't need to worry about that for this code. I'll keep you posted |
@tensorsofthewall ok I moved it to the Rust side, but it still takes a decent amount more. Previous (on my machine):
With threads:
So a 33% reduction. So it would likely only brings yours down to 30s or so. I have a feeling it is because I am using unicode libraries to determine certain semantic levels, while semchunk is only doing a smaller set of punctuation and whitespace. The unicode should be resilient to more languages and characters, but then it would likely depend on your dataset if the difference matters. So, as usual, it is tradeoffs. If you like the chunk quality of semchunk better, and it is faster, win-win. But if you like the chunk quality of semantic-text-splitter better, then it will have to be if the time difference matters to you. Also depends on how many chunks at a time and how large. I do know from another issue a user had that semantic-text-splitter performs better than semchunk on HUGE documents (umarbutler/semchunk#8 ) so again... comes down to your dataset and what you need. I will still attempt to push the thread feature out since it still is a significant improvement and the thread pool implementation on the python side isnt' very ergonomic... |
Sounds good! For my use-case semantic-text-splitter is worth the performance tradeoff :) Do let me know when you release an updated version. Thanks! |
Ok @tensorsofthewall https://github.com/benbrandt/text-splitter/releases/tag/v0.19.1 is out! Let me know if it helps at all. I will also look into swapping out the unicode parsing, as there is another library I can try to see if it is any faster. Also, I realized GitHub also released a new tokenization crate optimized for chunking that might help. Which models are you using? Just curious if it would help at all, as it doesn't support all models. |
Teeny tiny correction: the document in question was HUGE and had no newlines. But yeah, semchunk will do better for some inputs and worse for others, though I’m also looking out for new improvements. |
oh that's right, the lack of newlines definitely made the binary search space huge. I adjusted it to have a search space expansion before binary searching, which increased latency for certain inputs, but provided an upper bound for documents with large amounts of items at the same semantic level. |
Thanks, I'll try this out!
I'm currently using gte-base-v1.5, although I'm open to using other embedding models. |
Ah ok good to know the new one just supports some openai ones out of the box, but I can see how hard it would be to support tokenizers package ones. Thanks! |
I'm currently in the process of integrating a semantic chunker into my RAG application, but I'm having some issue understanding the reason for the performance / functionality differences between semchunk and text-splitter. I don;t know Rust much, so I apologize if I'm asking a silly question.
Could you tell me what the fundamental difference is between semchunk and text-splitter in terms of functionality? In terms of performance difference, I'm assuming a part of semchunk's performance comes from its parallelization (mpire). I see that one of the tasks on the project roadmap is to implement parallelization. I benchmarked the two using semchunk's provided benchmark code; semchunk took 5.2 seconds to chunk the Gutenberg corpus compared to semantic-text-splitter's 50.2 seconds runtime.
The text was updated successfully, but these errors were encountered: