[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91

theoden8 · 2024-12-12T14:43:05Z

Describe the bug
When I run semantic/SDPM chunker, min_chunk_size is not respected.

To Reproduce

#!/usr/bin/env python3


import chonkie


text = """
Ja5uKaexug
"""
new_text = ""
for t in text.strip():
    new_text += " " + t
new_text += "\n"
text = new_text * 100


if __name__ == '__main__':
    chunker = chonkie.SDPMChunker(
        embedding_model='BAAI/bge-large-en-v1.5',
        min_chunk_size=256,
        chunk_size=512*2,
        skip_window=0,
        threshold=0.9,
    )
    chunks = [t.text for t in chunker.chunk(text)]
    print([chunker.embedding_model.count_tokens(t) for t in chunks])
    # [932, 62]
    print([len(t) for t in chunks])
    # [1953, 126]

Expected behavior

The minimum chunk size must be >= 256 tokens.

Additional context

Related to recently implemented #40

bhavnicksm · 2024-12-12T15:23:52Z

Hey @theoden8!

Thanks for submitting an issue! 😊

Please let me explain as to why this is expected behaviour with Chonkie SPDMChunker at the moment.

Chonkie's SPDMChunker (and other chunkers) have a strict limit for maximum i.e. chunk_size and making sure the chunks are reconstructable i.e. all chunks combined back into the original text. So min_chunk_size is lower in priority than chunk_size, especially in SemanticChunkers where the threshold can cause you to sometimes have a conflict between those parameters.

So the first chunk comes out to be 932 < 1024 (chunk_size) and the last chunk is whatever is remaining. Since SPDMChunker only combines the text that is semantically similar and the text and threshold set are such that they would almost always combine. Hence, you end up seeing this behaviour.

In most cases, the maximum matters more since there's a hard limit to embedding model context. And you can easily augment the chunk with OverlapRefinery which adds a overlap~

So, this is expected behaviour at the moment—but if this doesn't work for you, could you tell me how this could be made better? 🙏

Thanks ☺️

theoden8 · 2024-12-12T15:36:47Z

Hey, thank you for a detailed explanation!

The hard limit for the maximum length of the chunk makes sense indeed, as anything beyond would be truncated. however, there are some reasons to want bigger chunks:

Parent chunks: no point having small chunks in parent chunks as they are supposed to provide wider context
Structured data: I use markdown delimiter first, semantic segmentation second and merger of small leftover pieces third. Having headers in chunks is more important than embedding distance outliers, and making sure bulletpoints don't break in the middle also takes precedence (third step).
Small chunks clutter vector database unnecessarily. Having those 3 tokens attached to some chunk wouldn't harm.

So, basically in hybrid splitting structure and determinism take precedence for me, and semantic splitting is a nice extra that helps split the chunks not completely arbitrarily.

bhavnicksm · 2024-12-12T15:44:47Z

Thanks for the valuable feedback, @theoden8! 😄🫂

Totally valid, makes sense! In fact, I am working towards making a hierarchial/sequential chunking solution with benefits on Markdown, with Semantic second and then making sure that the chunks are not too small as well. It's definitely on the roadmap.

I'd have to ask you to be patient with Chonkie, since it's planned for release with version v0.3.0 or beyond :)

Thanks 😊

(P.S. Since I can't "resolve" the issue at the moment, I will keep this open but change to a feature request instead, if you don't mind... And I will ping you here when I am done with the feature.)

theoden8 added the bug Something isn't working label Dec 12, 2024

theoden8 assigned bhavnicksm Dec 12, 2024

bhavnicksm added enhancement New feature or request and removed bug Something isn't working labels Dec 12, 2024

bhavnicksm changed the title ~~[BUG] min_chunk_size is not respected~~ [FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91

[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91

theoden8 commented Dec 12, 2024

bhavnicksm commented Dec 12, 2024

theoden8 commented Dec 12, 2024 •

edited

Loading

bhavnicksm commented Dec 12, 2024

[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91

[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91

Comments

theoden8 commented Dec 12, 2024

bhavnicksm commented Dec 12, 2024

theoden8 commented Dec 12, 2024 • edited Loading

bhavnicksm commented Dec 12, 2024

theoden8 commented Dec 12, 2024 •

edited

Loading