Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: text splitters not splitting file inputs correctly #3839

Closed
jordanrfrazier opened this issue Sep 17, 2024 · 6 comments
Closed

bug: text splitters not splitting file inputs correctly #3839

jordanrfrazier opened this issue Sep 17, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@jordanrfrazier
Copy link
Collaborator

Bug Description

Text splitters not splitting file inputs.

Reproduction

file -> text splitter

Expected behavior

split

Who can help?

@jordanrfrazier

Operating System

m

Langflow Version

1.0.18

Python Version

None

Screenshot

No response

Flow File

No response

@jordanrfrazier jordanrfrazier added the bug Something isn't working label Sep 17, 2024
@jordanrfrazier jordanrfrazier self-assigned this Sep 17, 2024
@rayhannarindran
Copy link

Is this problem solved yet? currently the output of the experimental text splitters (such as recursive text splitter) are unusable, any workarounds?

@edwinjosechittilappilly
Copy link
Collaborator

Hi @rayhannarindran, We would be testing it shortly and letting you know about the progress.

@jordanrfrazier
Copy link
Collaborator Author

@edwinjosechittilappilly Do you have any updates for this issue?

@edwinjosechittilappilly
Copy link
Collaborator

Hi @jordanrfrazier

I’ve been testing multiple cases with the chunk sizes, and I’ve found that everything is working well in most scenarios. However, there are issues when the chunk sizes are too low. To address this, I plan to establish a minimum value for the chunks.

I’ll determine the exact minimum chunk size by Monday through further testing and by reviewing the LangChain text splitter documentation.

Additionally, these tests have also led me to discover a file upload limit error.

@edwinjosechittilappilly
Copy link
Collaborator

I wanted to clarify that in LangChain, there is no fixed minimum chunk size for text splitters. The chunk size is a configurable parameter that you can adjust based on your specific needs and the nature of your text data. You can refer to this Stack Overflow discussion for more insights.

In the LangChain code, the default values are:

  • chunk_size: int = 4000
  • chunk_overlap: int = 200

It’s important to ensure that the chunk size is smaller than the maximum token limit of the language model you’re using.

Also another guidelines are :

  • For code snippets, chunk sizes of 300-500 tokens might be appropriate.
  • For engineering requirements documents, chunks of 1000-1500 tokens could be suitable.

Given that we encountered issues where the number of splits doesn’t change with smaller values, such as chunk sizes of 10 or 20, we can add some test cases to investigate why the text splitter fails at these sizes. Additionally, smaller chunks are not viable for an application either.

Currently, the default value of 1000 in Langflow seems to be the most effective one for the time being. It’s also worth noting that there is already error handling in place if the chunk size is less than the overlap size.

Let me know your thoughts!

@edwinjosechittilappilly
Copy link
Collaborator

TODO: Add unit test cases in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants