Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91

Open
theoden8 opened this issue Dec 12, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@theoden8
Copy link

Describe the bug
When I run semantic/SDPM chunker, min_chunk_size is not respected.

To Reproduce

#!/usr/bin/env python3


import chonkie


text = """
Ja5uKaexug
"""
new_text = ""
for t in text.strip():
    new_text += " " + t
new_text += "\n"
text = new_text * 100


if __name__ == '__main__':
    chunker = chonkie.SDPMChunker(
        embedding_model='BAAI/bge-large-en-v1.5',
        min_chunk_size=256,
        chunk_size=512*2,
        skip_window=0,
        threshold=0.9,
    )
    chunks = [t.text for t in chunker.chunk(text)]
    print([chunker.embedding_model.count_tokens(t) for t in chunks])
    # [932, 62]
    print([len(t) for t in chunks])
    # [1953, 126]

Expected behavior

The minimum chunk size must be >= 256 tokens.

Additional context

Related to recently implemented #40

@theoden8 theoden8 added the bug Something isn't working label Dec 12, 2024
@bhavnicksm
Copy link
Collaborator

Hey @theoden8!

Thanks for submitting an issue! 😊

Please let me explain as to why this is expected behaviour with Chonkie SPDMChunker at the moment.

Chonkie's SPDMChunker (and other chunkers) have a strict limit for maximum i.e. chunk_size and making sure the chunks are reconstructable i.e. all chunks combined back into the original text. So min_chunk_size is lower in priority than chunk_size, especially in SemanticChunkers where the threshold can cause you to sometimes have a conflict between those parameters.

So the first chunk comes out to be 932 < 1024 (chunk_size) and the last chunk is whatever is remaining. Since SPDMChunker only combines the text that is semantically similar and the text and threshold set are such that they would almost always combine. Hence, you end up seeing this behaviour.

In most cases, the maximum matters more since there's a hard limit to embedding model context. And you can easily augment the chunk with OverlapRefinery which adds a overlap~

So, this is expected behaviour at the moment—but if this doesn't work for you, could you tell me how this could be made better? 🙏

Thanks ☺️

@theoden8
Copy link
Author

theoden8 commented Dec 12, 2024

Hey, thank you for a detailed explanation!

The hard limit for the maximum length of the chunk makes sense indeed, as anything beyond would be truncated. however, there are some reasons to want bigger chunks:

  1. Parent chunks: no point having small chunks in parent chunks as they are supposed to provide wider context
  2. Structured data: I use markdown delimiter first, semantic segmentation second and merger of small leftover pieces third. Having headers in chunks is more important than embedding distance outliers, and making sure bulletpoints don't break in the middle also takes precedence (third step).
  3. Small chunks clutter vector database unnecessarily. Having those 3 tokens attached to some chunk wouldn't harm.

So, basically in hybrid splitting structure and determinism take precedence for me, and semantic splitting is a nice extra that helps split the chunks not completely arbitrarily.

@bhavnicksm
Copy link
Collaborator

Thanks for the valuable feedback, @theoden8! 😄🫂

Totally valid, makes sense! In fact, I am working towards making a hierarchial/sequential chunking solution with benefits on Markdown, with Semantic second and then making sure that the chunks are not too small as well. It's definitely on the roadmap.

I'd have to ask you to be patient with Chonkie, since it's planned for release with version v0.3.0 or beyond :)

Thanks 😊

(P.S. Since I can't "resolve" the issue at the moment, I will keep this open but change to a feature request instead, if you don't mind... And I will ping you here when I am done with the feature.)

@bhavnicksm bhavnicksm added enhancement New feature or request and removed bug Something isn't working labels Dec 12, 2024
@bhavnicksm bhavnicksm changed the title [BUG] min_chunk_size is not respected [FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants