[BUG] TokenChunker Batch_chunking gives wrong end_index #84

CharlesMoslonka · 2024-12-09T18:11:48Z

Describe the bug

When using the chunk_batch() method, the resulting Chunks have a wrong end.index. The indexes seem to be counted in token units instead of string characters unit. This does not happen when using the single .chunk() method.

To Reproduce

Suppose that text_ds is a list of str that contains the texts you want to chunk.

chunker = TokenChunker(
    tokenizer=tokenizer, 
    chunk_size=300,
    chunk_overlap=20,
)

chunks = chunker.chunk_batch(text_ds)  
print(chunks[0][0].end_index)

this prints 300 or whatever chunk_size is.

Expected behavior

chunks[0][0].end_index should return a greater int value.

Additional context

I could not check for other chunkers, I have the same issue as #73 . I tried to look in the code, maybe it originates from the _process_batch method of the TokenChunker class ? I'll try to go deeper if I have time.

Anyway, thanks for your time and for the great package !
Cheers !

The text was updated successfully, but these errors were encountered:

bhavnicksm · 2024-12-09T19:10:00Z

Hey @CharlesMoslonka!

Thanks for opening an issue and the kind words 😊

I understand the issue you are seeing, and I'll look into reproducing it so as to add a patch for it to Chonkie, at the earliest. Thanks to the detailed reproduction steps, I should be able to do that quickly.

Will update regarding the progress on this bug here,

Thanks! ☺️

#84) (#109) * [fix] #84: Add proper indices for Batch Token Chunking * [fix] Indices incorrect due to Overlap * Add a test case for batch token indices verification

[fix] Correct the start and end indices for TokenChunker in Batch mode (#84)

bhavnicksm · 2024-12-27T10:55:01Z

Thanks for your patience~

The patch is in the source and will be released with the next patch release soon!

Thanks! 😊

#84) (#109) * [fix] #84: Add proper indices for Batch Token Chunking * [fix] Indices incorrect due to Overlap * Add a test case for batch token indices verification

CharlesMoslonka added the bug Something isn't working label Dec 9, 2024

CharlesMoslonka assigned bhavnicksm Dec 9, 2024

shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024

bhavnicksm added a commit that referenced this issue Dec 26, 2024

[fix] #84: Add proper indices for Batch Token Chunking

57e258d

bhavnicksm mentioned this issue Dec 27, 2024

[fix] Correct the start and end indices for TokenChunker in Batch mode (#84) #110

Merged

bhavnicksm added a commit that referenced this issue Dec 27, 2024

Merge pull request #110 from chonkie-ai/development

83defe8

[fix] Correct the start and end indices for TokenChunker in Batch mode (#84)

bhavnicksm closed this as completed Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

CharlesMoslonka commented Dec 9, 2024

bhavnicksm commented Dec 9, 2024

bhavnicksm commented Dec 27, 2024

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

Comments

CharlesMoslonka commented Dec 9, 2024

bhavnicksm commented Dec 9, 2024

bhavnicksm commented Dec 27, 2024