-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: text splitters not splitting file inputs correctly #3839
Comments
Is this problem solved yet? currently the output of the experimental text splitters (such as recursive text splitter) are unusable, any workarounds? |
Hi @rayhannarindran, We would be testing it shortly and letting you know about the progress. |
@edwinjosechittilappilly Do you have any updates for this issue? |
I’ve been testing multiple cases with the chunk sizes, and I’ve found that everything is working well in most scenarios. However, there are issues when the chunk sizes are too low. To address this, I plan to establish a minimum value for the chunks. I’ll determine the exact minimum chunk size by Monday through further testing and by reviewing the LangChain text splitter documentation. Additionally, these tests have also led me to discover a file upload limit error. |
I wanted to clarify that in LangChain, there is no fixed minimum chunk size for text splitters. The chunk size is a configurable parameter that you can adjust based on your specific needs and the nature of your text data. You can refer to this Stack Overflow discussion for more insights. In the LangChain code, the default values are:
It’s important to ensure that the chunk size is smaller than the maximum token limit of the language model you’re using. Also another guidelines are :
Given that we encountered issues where the number of splits doesn’t change with smaller values, such as chunk sizes of 10 or 20, we can add some test cases to investigate why the text splitter fails at these sizes. Additionally, smaller chunks are not viable for an application either. Currently, the default value of 1000 in Langflow seems to be the most effective one for the time being. It’s also worth noting that there is already error handling in place if the chunk size is less than the overlap size. Let me know your thoughts! |
TODO: Add unit test cases in future. |
Bug Description
Text splitters not splitting file inputs.
Reproduction
file -> text splitter
Expected behavior
split
Who can help?
@jordanrfrazier
Operating System
m
Langflow Version
1.0.18
Python Version
None
Screenshot
No response
Flow File
No response
The text was updated successfully, but these errors were encountered: