-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TokenChunker Batch_chunking gives wrong end_index #84
Comments
Hey @CharlesMoslonka! Thanks for opening an issue and the kind words 😊 I understand the issue you are seeing, and I'll look into reproducing it so as to add a patch for it to Chonkie, at the earliest. Thanks to the detailed reproduction steps, I should be able to do that quickly. Will update regarding the progress on this bug here, Thanks! |
[fix] Correct the start and end indices for TokenChunker in Batch mode (#84)
Thanks for your patience~ The patch is in the source and will be released with the next patch release soon! Thanks! 😊 |
Describe the bug
When using the
chunk_batch()
method, the resultingChunks
have a wrongend.index
. The indexes seem to be counted in token units instead of string characters unit. This does not happen when using the single.chunk()
method.To Reproduce
Suppose that
text_ds
is alist
ofstr
that contains the texts you want to chunk.this prints
300
or whateverchunk_size
is.Expected behavior
chunks[0][0].end_index
should return a greaterint
value.Additional context
I could not check for other chunkers, I have the same issue as #73 . I tried to look in the code, maybe it originates from the
_process_batch
method of theTokenChunker
class ? I'll try to go deeper if I have time.Anyway, thanks for your time and for the great package !
Cheers !
The text was updated successfully, but these errors were encountered: