-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add DocumentCleaner 2.0 #5976
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @julian-risch... Good work!
I found some opportunities for improvement
(and several occasions for me to better understand).
haystack/preview/components/preprocessors/text_document_cleaner.py
Outdated
Show resolved
Hide resolved
haystack/preview/components/preprocessors/text_document_cleaner.py
Outdated
Show resolved
Hide resolved
haystack/preview/components/preprocessors/text_document_cleaner.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
(Only a small comment about the docstring)
Related Issues
TextDocumentCleaner
#5676Proposed Changes:
Add new DocumentCleaner component with the options to
Also added new unit tests for this component.
The code for removing repeated substrings (footers, headers) was mostly copied over from 1.x.
How did you test it?
Added new unit tests. We should add an end-to-end test later with an indexing pipeline containing a file converter, text document cleaner and text document splitter components. The end-to-end test will be added in the PR with the last component needed here: https://github.com/deepset-ai/haystack/pull/6037/files#diff-963c94f5742eb94f8771a87759aa17307f9cc868fa6ecbe80a431b4dcf14cf28
Notes for the reviewer
The issues mentions to update "structure dictionary properly if it’s present" but I didn't address it so far in this PR. Not 100% clear to me what this should look like and probably not needed for 2.0. Could be added later.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.