Skip to content

raise error if token count exceeds 1024 instead of attempting to re-chunk #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

khaledsulayman
Copy link
Member

@khaledsulayman khaledsulayman commented Jun 10, 2025

In the sdg_hub code where this came from, there was a call to chunk_document() that would be made if the token count of a particular chunk exceeded 1024.

The chunk_document function uses the recursive text splitter to split chunks to certain token size thresholds, but realistically should not be needed anymore given our use of the hybrid chunker. This PR raises a more explicit error in the case that we do exceed 1024.

@khaledsulayman khaledsulayman marked this pull request as draft June 12, 2025 17:35
@khaledsulayman khaledsulayman force-pushed the fix-rechunking branch 2 times, most recently from 328e9ef to c78d1f7 Compare June 13, 2025 16:44
@khaledsulayman khaledsulayman marked this pull request as ready for review June 13, 2025 16:50
)
for c in chunked_document_all_icl:
if get_token_count(c["document"], tokenizer) > 1024:
raise ValueError("Chunk exceeds token count of 1024")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get the first X and last X tokens in the chunk printed out as well as all chunks that are too big into a list and all printed out at once. This way a users know where they need to trim down their chunks before moving forward.

It would look something like:

Chunk Size Errors:
Chunk "foo bar baz ... biz baz bar."  exceeds max token count of 1024.
Chunk "foo2 bar2 baz2 ... biz2 baz2 bar2."  exceeds max token count of 1024.
....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants