semchunk vs text-splitter #507

tensorsofthewall · 2024-12-13T13:11:19Z

I'm currently in the process of integrating a semantic chunker into my RAG application, but I'm having some issue understanding the reason for the performance / functionality differences between semchunk and text-splitter. I don;t know Rust much, so I apologize if I'm asking a silly question.

Could you tell me what the fundamental difference is between semchunk and text-splitter in terms of functionality? In terms of performance difference, I'm assuming a part of semchunk's performance comes from its parallelization (mpire). I see that one of the tasks on the project roadmap is to implement parallelization. I benchmarked the two using semchunk's provided benchmark code; semchunk took 5.2 seconds to chunk the Gutenberg corpus compared to semantic-text-splitter's 50.2 seconds runtime.

benbrandt · 2024-12-13T16:29:44Z

Can you send me a copy of the code you are running? There is no reason you can't run this in parallel if it is for multiple documents, but I could document how. It's pretty trivial in Rust, but maybe the Python bindings could use a nicer interface. But it would help to see how you are calling it.

Thanks!

tensorsofthewall · 2024-12-13T16:47:42Z

I'm using this code for benchmarking: bench.py. Additional documentation for the Python bindings would be super helpful! In my use case, I plan to periodically create chunks from batches of files as and when they're updated, which will run in a separate thread from the main application. So, parallelization would be really good if possible.

benbrandt · 2024-12-13T20:34:45Z

Thanks @tensorsofthewall. It does indeed seem like a not apples to apples comparison, as it is doing one serially and one in parallel. It could be made better by using a threadpool, but maybe that is too advanced for most users.

I'll do some testing with either a threadpool and document an example, or pull the threadpool in on the Rust side of the bindings. You aren't the first to ask for it, and I suspect the benchmarks will get much closer in this example. But we'll see

tensorsofthewall · 2024-12-13T21:04:59Z

Thanks @benbrandt! Either approach would work, ofcourse making changes within the Rust code would be better long-term. Pls let me know when you've been able to test either approach :)

benbrandt · 2024-12-13T21:09:58Z

@tensorsofthewall I did a test here: https://github.com/benbrandt/semchunk/blob/parallel-benchmark/tests/bench.py#L26-L28

It cut down by 30% or so, but still slower. I will test it out with a threadpool in Rust over the weekend. I suspect the GIL might be playing a role here and we don't need to worry about that for this code. I'll keep you posted

benbrandt · 2024-12-14T06:09:12Z

@tensorsofthewall ok I moved it to the Rust side, but it still takes a decent amount more.

Previous (on my machine):

semchunk: 2.47s
semantic_text_splitter: 28.65s

With threads:

semchunk: 2.50s
semantic_text_splitter: 18.94s

So a 33% reduction. So it would likely only brings yours down to 30s or so.

I have a feeling it is because I am using unicode libraries to determine certain semantic levels, while semchunk is only doing a smaller set of punctuation and whitespace. The unicode should be resilient to more languages and characters, but then it would likely depend on your dataset if the difference matters.

So, as usual, it is tradeoffs. If you like the chunk quality of semchunk better, and it is faster, win-win. But if you like the chunk quality of semantic-text-splitter better, then it will have to be if the time difference matters to you.

Also depends on how many chunks at a time and how large. I do know from another issue a user had that semantic-text-splitter performs better than semchunk on HUGE documents (umarbutler/semchunk#8 ) so again... comes down to your dataset and what you need.

I will still attempt to push the thread feature out since it still is a significant improvement and the thread pool implementation on the python side isnt' very ergonomic...

tensorsofthewall · 2024-12-14T06:15:43Z

Sounds good! For my use-case semantic-text-splitter is worth the performance tradeoff :) Do let me know when you release an updated version. Thanks!

benbrandt · 2024-12-14T07:14:51Z

Ok @tensorsofthewall https://github.com/benbrandt/text-splitter/releases/tag/v0.19.1 is out! Let me know if it helps at all.

I will also look into swapping out the unicode parsing, as there is another library I can try to see if it is any faster.

Also, I realized GitHub also released a new tokenization crate optimized for chunking that might help. Which models are you using? Just curious if it would help at all, as it doesn't support all models.

umarbutler · 2024-12-14T08:37:58Z

Also depends on how many chunks at a time and how large. I do know from another issue a user had that semantic-text-splitter performs better than semchunk on HUGE documents (umarbutler/semchunk#8 ) so again... comes down to your dataset and what you need.

Teeny tiny correction: the document in question was HUGE and had no newlines. But yeah, semchunk will do better for some inputs and worse for others, though I’m also looking out for new improvements.

benbrandt · 2024-12-14T08:45:26Z

oh that's right, the lack of newlines definitely made the binary search space huge.

I adjusted it to have a search space expansion before binary searching, which increased latency for certain inputs, but provided an upper bound for documents with large amounts of items at the same semantic level.

tensorsofthewall · 2024-12-14T08:58:31Z

Ok @tensorsofthewall https://github.com/benbrandt/text-splitter/releases/tag/v0.19.1 is out! Let me know if it helps at all.

Thanks, I'll try this out!

Also, I realized GitHub also released a new tokenization crate optimized for chunking that might help. Which models are you using? Just curious if it would help at all, as it doesn't support all models.

I'm currently using gte-base-v1.5, although I'm open to using other embedding models.

benbrandt · 2024-12-14T09:18:24Z

Ah ok good to know the new one just supports some openai ones out of the box, but I can see how hard it would be to support tokenizers package ones. Thanks!

benbrandt mentioned this issue Dec 14, 2024

bench: Run semantic-text-splitter in parallel umarbutler/semchunk#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semchunk vs text-splitter #507

semchunk vs text-splitter #507

tensorsofthewall commented Dec 13, 2024 •

edited

Loading

benbrandt commented Dec 13, 2024

tensorsofthewall commented Dec 13, 2024

benbrandt commented Dec 13, 2024

tensorsofthewall commented Dec 13, 2024

benbrandt commented Dec 13, 2024 •

edited

Loading

benbrandt commented Dec 14, 2024

tensorsofthewall commented Dec 14, 2024

benbrandt commented Dec 14, 2024

umarbutler commented Dec 14, 2024

benbrandt commented Dec 14, 2024

tensorsofthewall commented Dec 14, 2024

benbrandt commented Dec 14, 2024

semchunk vs text-splitter #507

semchunk vs text-splitter #507

Comments

tensorsofthewall commented Dec 13, 2024 • edited Loading

benbrandt commented Dec 13, 2024

tensorsofthewall commented Dec 13, 2024

benbrandt commented Dec 13, 2024

tensorsofthewall commented Dec 13, 2024

benbrandt commented Dec 13, 2024 • edited Loading

benbrandt commented Dec 14, 2024

tensorsofthewall commented Dec 14, 2024

benbrandt commented Dec 14, 2024

umarbutler commented Dec 14, 2024

benbrandt commented Dec 14, 2024

tensorsofthewall commented Dec 14, 2024

benbrandt commented Dec 14, 2024

tensorsofthewall commented Dec 13, 2024 •

edited

Loading

benbrandt commented Dec 13, 2024 •

edited

Loading