Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semchunk vs text-splitter #507

Open
tensorsofthewall opened this issue Dec 13, 2024 · 12 comments
Open

semchunk vs text-splitter #507

tensorsofthewall opened this issue Dec 13, 2024 · 12 comments

Comments

@tensorsofthewall
Copy link

tensorsofthewall commented Dec 13, 2024

I'm currently in the process of integrating a semantic chunker into my RAG application, but I'm having some issue understanding the reason for the performance / functionality differences between semchunk and text-splitter. I don;t know Rust much, so I apologize if I'm asking a silly question.

Could you tell me what the fundamental difference is between semchunk and text-splitter in terms of functionality? In terms of performance difference, I'm assuming a part of semchunk's performance comes from its parallelization (mpire). I see that one of the tasks on the project roadmap is to implement parallelization. I benchmarked the two using semchunk's provided benchmark code; semchunk took 5.2 seconds to chunk the Gutenberg corpus compared to semantic-text-splitter's 50.2 seconds runtime.

@benbrandt
Copy link
Owner

Can you send me a copy of the code you are running? There is no reason you can't run this in parallel if it is for multiple documents, but I could document how. It's pretty trivial in Rust, but maybe the Python bindings could use a nicer interface. But it would help to see how you are calling it.

Thanks!

@tensorsofthewall
Copy link
Author

I'm using this code for benchmarking: bench.py. Additional documentation for the Python bindings would be super helpful! In my use case, I plan to periodically create chunks from batches of files as and when they're updated, which will run in a separate thread from the main application. So, parallelization would be really good if possible.

@benbrandt
Copy link
Owner

Thanks @tensorsofthewall. It does indeed seem like a not apples to apples comparison, as it is doing one serially and one in parallel. It could be made better by using a threadpool, but maybe that is too advanced for most users.

I'll do some testing with either a threadpool and document an example, or pull the threadpool in on the Rust side of the bindings. You aren't the first to ask for it, and I suspect the benchmarks will get much closer in this example. But we'll see

@tensorsofthewall
Copy link
Author

Thanks @benbrandt! Either approach would work, ofcourse making changes within the Rust code would be better long-term. Pls let me know when you've been able to test either approach :)

@benbrandt
Copy link
Owner

benbrandt commented Dec 13, 2024

@tensorsofthewall I did a test here: https://github.com/benbrandt/semchunk/blob/parallel-benchmark/tests/bench.py#L26-L28

It cut down by 30% or so, but still slower. I will test it out with a threadpool in Rust over the weekend. I suspect the GIL might be playing a role here and we don't need to worry about that for this code. I'll keep you posted

@benbrandt
Copy link
Owner

@tensorsofthewall ok I moved it to the Rust side, but it still takes a decent amount more.

Previous (on my machine):

semchunk: 2.47s
semantic_text_splitter: 28.65s

With threads:

semchunk: 2.50s
semantic_text_splitter: 18.94s

So a 33% reduction. So it would likely only brings yours down to 30s or so.

I have a feeling it is because I am using unicode libraries to determine certain semantic levels, while semchunk is only doing a smaller set of punctuation and whitespace. The unicode should be resilient to more languages and characters, but then it would likely depend on your dataset if the difference matters.

So, as usual, it is tradeoffs. If you like the chunk quality of semchunk better, and it is faster, win-win. But if you like the chunk quality of semantic-text-splitter better, then it will have to be if the time difference matters to you.

Also depends on how many chunks at a time and how large. I do know from another issue a user had that semantic-text-splitter performs better than semchunk on HUGE documents (umarbutler/semchunk#8 ) so again... comes down to your dataset and what you need.

I will still attempt to push the thread feature out since it still is a significant improvement and the thread pool implementation on the python side isnt' very ergonomic...

@tensorsofthewall
Copy link
Author

Sounds good! For my use-case semantic-text-splitter is worth the performance tradeoff :) Do let me know when you release an updated version. Thanks!

@benbrandt
Copy link
Owner

Ok @tensorsofthewall https://github.com/benbrandt/text-splitter/releases/tag/v0.19.1 is out! Let me know if it helps at all.

I will also look into swapping out the unicode parsing, as there is another library I can try to see if it is any faster.

Also, I realized GitHub also released a new tokenization crate optimized for chunking that might help. Which models are you using? Just curious if it would help at all, as it doesn't support all models.

@umarbutler
Copy link

Also depends on how many chunks at a time and how large. I do know from another issue a user had that semantic-text-splitter performs better than semchunk on HUGE documents (umarbutler/semchunk#8 ) so again... comes down to your dataset and what you need.

Teeny tiny correction: the document in question was HUGE and had no newlines. But yeah, semchunk will do better for some inputs and worse for others, though I’m also looking out for new improvements.

@benbrandt
Copy link
Owner

oh that's right, the lack of newlines definitely made the binary search space huge.

I adjusted it to have a search space expansion before binary searching, which increased latency for certain inputs, but provided an upper bound for documents with large amounts of items at the same semantic level.

@tensorsofthewall
Copy link
Author

Ok @tensorsofthewall https://github.com/benbrandt/text-splitter/releases/tag/v0.19.1 is out! Let me know if it helps at all.

Thanks, I'll try this out!

Also, I realized GitHub also released a new tokenization crate optimized for chunking that might help. Which models are you using? Just curious if it would help at all, as it doesn't support all models.

I'm currently using gte-base-v1.5, although I'm open to using other embedding models.

@benbrandt
Copy link
Owner

Ah ok good to know the new one just supports some openai ones out of the box, but I can see how hard it would be to support tokenizers package ones. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants