bug: text splitters not splitting file inputs correctly #3839

jordanrfrazier · 2024-09-17T23:36:58Z

Bug Description

Text splitters not splitting file inputs.

Reproduction

file -> text splitter

Expected behavior

split

Who can help?

@jordanrfrazier

Operating System

m

Langflow Version

1.0.18

Python Version

None

Screenshot

No response

Flow File

No response

rayhannarindran · 2024-09-19T16:26:14Z

Is this problem solved yet? currently the output of the experimental text splitters (such as recursive text splitter) are unusable, any workarounds?

edwinjosechittilappilly · 2024-09-19T18:23:56Z

Hi @rayhannarindran, We would be testing it shortly and letting you know about the progress.

jordanrfrazier · 2024-09-22T18:05:35Z

@edwinjosechittilappilly Do you have any updates for this issue?

edwinjosechittilappilly · 2024-09-22T20:05:22Z

Hi @jordanrfrazier

I’ve been testing multiple cases with the chunk sizes, and I’ve found that everything is working well in most scenarios. However, there are issues when the chunk sizes are too low. To address this, I plan to establish a minimum value for the chunks.

I’ll determine the exact minimum chunk size by Monday through further testing and by reviewing the LangChain text splitter documentation.

Additionally, these tests have also led me to discover a file upload limit error.

edwinjosechittilappilly · 2024-09-25T15:33:55Z

I wanted to clarify that in LangChain, there is no fixed minimum chunk size for text splitters. The chunk size is a configurable parameter that you can adjust based on your specific needs and the nature of your text data. You can refer to this Stack Overflow discussion for more insights.

In the LangChain code, the default values are:

chunk_size: int = 4000
chunk_overlap: int = 200

It’s important to ensure that the chunk size is smaller than the maximum token limit of the language model you’re using.

Also another guidelines are :

For code snippets, chunk sizes of 300-500 tokens might be appropriate.
For engineering requirements documents, chunks of 1000-1500 tokens could be suitable.

Given that we encountered issues where the number of splits doesn’t change with smaller values, such as chunk sizes of 10 or 20, we can add some test cases to investigate why the text splitter fails at these sizes. Additionally, smaller chunks are not viable for an application either.

Currently, the default value of 1000 in Langflow seems to be the most effective one for the time being. It’s also worth noting that there is already error handling in place if the chunk size is less than the overlap size.

Let me know your thoughts!

edwinjosechittilappilly · 2024-10-21T14:41:55Z

TODO: Add unit test cases in future.

jordanrfrazier added the bug Something isn't working label Sep 17, 2024

jordanrfrazier self-assigned this Sep 17, 2024

jordanrfrazier assigned edwinjosechittilappilly Sep 19, 2024

edwinjosechittilappilly closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: text splitters not splitting file inputs correctly #3839

bug: text splitters not splitting file inputs correctly #3839

jordanrfrazier commented Sep 17, 2024

rayhannarindran commented Sep 19, 2024

edwinjosechittilappilly commented Sep 19, 2024

jordanrfrazier commented Sep 22, 2024

edwinjosechittilappilly commented Sep 22, 2024

edwinjosechittilappilly commented Sep 25, 2024

edwinjosechittilappilly commented Oct 21, 2024

bug: text splitters not splitting file inputs correctly #3839

bug: text splitters not splitting file inputs correctly #3839

Comments

jordanrfrazier commented Sep 17, 2024

Bug Description

Reproduction

Expected behavior

Who can help?

Operating System

Langflow Version

Python Version

Screenshot

Flow File

rayhannarindran commented Sep 19, 2024

edwinjosechittilappilly commented Sep 19, 2024

jordanrfrazier commented Sep 22, 2024

edwinjosechittilappilly commented Sep 22, 2024

edwinjosechittilappilly commented Sep 25, 2024

edwinjosechittilappilly commented Oct 21, 2024