Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RecursiveSplitter component for Document preprocessing #8605

Merged
merged 100 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
a49fc93
initial import
davidsbatista Nov 20, 2024
41f5f64
initial import
davidsbatista Nov 20, 2024
79c669e
wip
davidsbatista Nov 20, 2024
87b8023
adding initial version + tests
davidsbatista Nov 25, 2024
09b25f3
adding more tests
davidsbatista Dec 2, 2024
a39f481
more tests
davidsbatista Dec 2, 2024
4e9b4ea
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 4, 2024
db82194
incorporating SentenceSplitter based on NLTK
davidsbatista Dec 4, 2024
cbfcc66
adding more tests
davidsbatista Dec 4, 2024
74de92c
adding release notes
davidsbatista Dec 4, 2024
4054c47
adding LICENSE header
davidsbatista Dec 4, 2024
6b72a17
removing unused imports
davidsbatista Dec 4, 2024
4c0afb1
fixing example docstring
davidsbatista Dec 4, 2024
24739be
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 4, 2024
8e62968
addding docstrings
davidsbatista Dec 4, 2024
12549bd
fixing tests and returning a dictionary
davidsbatista Dec 4, 2024
20a7f52
updating release notes
davidsbatista Dec 4, 2024
323319b
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 5, 2024
5945e6d
attending PR comments
davidsbatista Dec 6, 2024
01ad974
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 9, 2024
eaf9b77
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 9, 2024
b5391f6
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 10, 2024
adf1b1a
wip: updating tests for split_idx_start and _split_overlap
davidsbatista Dec 10, 2024
d4a2a0b
adding tests for split_idx and split_start and overlaps
davidsbatista Dec 11, 2024
aed28c5
adjusting file for LICENSE checking
davidsbatista Dec 11, 2024
eb5afb5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 11, 2024
824142f
adding more tests
davidsbatista Dec 11, 2024
e4815d8
adding tests for page numbering
davidsbatista Dec 11, 2024
8f1ae36
adding tests for min split lenghts and falling back to character-leve…
davidsbatista Dec 11, 2024
5a49eab
fixing linting issue
davidsbatista Dec 11, 2024
a5c1f2c
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
2248135
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
4263352
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
6ee5551
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
0325a8b
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
b2b94b5
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
85f2ea2
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
644056f
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
3cb85d9
wip
davidsbatista Dec 12, 2024
459bfa7
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 12, 2024
42faf05
wip
davidsbatista Dec 12, 2024
7d9c4df
updating tests
davidsbatista Dec 12, 2024
d66afd5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 12, 2024
5bcf709
wip: fixing all tests after changes
davidsbatista Dec 12, 2024
9205ef2
more tests
davidsbatista Dec 12, 2024
437570f
wip: debugging sentence overlap
davidsbatista Dec 12, 2024
97437d8
wip: debugging page number
davidsbatista Dec 13, 2024
13f85e1
wip
davidsbatista Dec 16, 2024
eebe1a0
wip; fixed bug with sentence tokenizer, needs to keep white spaces
davidsbatista Dec 16, 2024
3f00b3b
adding tests for counting pages on different split approaches
davidsbatista Dec 16, 2024
d9addfa
NLTK checks done on SentenceSplitter
davidsbatista Dec 16, 2024
080a529
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 16, 2024
c3f09d0
fixing types
davidsbatista Dec 16, 2024
2df40c3
adding detecting for full overlap with previous chunks
davidsbatista Dec 16, 2024
0492025
fixing types
davidsbatista Dec 16, 2024
09362e4
improving docstring
davidsbatista Dec 16, 2024
eb38a2b
improving docstring
davidsbatista Dec 16, 2024
a418f73
adding custom lenght, 'character' use case
davidsbatista Dec 17, 2024
71ce15b
customising overlap function for word and adding a few tests
davidsbatista Dec 17, 2024
3a9d290
updating docstring
davidsbatista Dec 17, 2024
f35d4e5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 17, 2024
938b610
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
bc4dfbd
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
371028c
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
79cd8bd
wip: adding more tests for word unit length
davidsbatista Dec 17, 2024
31c8412
fix
davidsbatista Dec 17, 2024
e1fed92
feat: `Tool` dataclass - unified abstraction to represent tools (#8652)
anakin87 Dec 18, 2024
f71a22b
fix: fix deserialization issues in multi-threading environments (#8651)
wochinge Dec 18, 2024
211c4ed
adding 'word' as default length
davidsbatista Dec 19, 2024
0807902
fixing types
davidsbatista Dec 19, 2024
460cc7d
handing both default strategies
davidsbatista Dec 19, 2024
7901af5
wip
davidsbatista Dec 19, 2024
2af6b03
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 19, 2024
8a09157
\f was not being counted properly
davidsbatista Dec 19, 2024
3ad73a5
updating tests
davidsbatista Dec 20, 2024
d292de6
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 20, 2024
b09154e
fixing the overlap bug
davidsbatista Dec 20, 2024
bd67369
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 20, 2024
c1fa6c2
adding more tests
davidsbatista Dec 21, 2024
de5e951
refactoring _apply_overlap
davidsbatista Dec 21, 2024
81c7c89
further refactoring
davidsbatista Dec 21, 2024
e398120
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 21, 2024
6977b2a
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 3, 2025
50ac7af
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
602ac9b
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
78ebc71
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
a6a2475
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
80b8f2c
adding ticks to close code block
davidsbatista Jan 8, 2025
2040c7c
fixing comments
davidsbatista Jan 8, 2025
977de8e
applying changes: split with space and force keep_white_spaces=True
davidsbatista Jan 8, 2025
c4ada43
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 8, 2025
d87ffe6
fixing some tests and replacing count words approach in more places
davidsbatista Jan 8, 2025
df214d6
keep_white_spaces = True only if not defined
davidsbatista Jan 9, 2025
25721bb
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 9, 2025
951956b
cleaning docs
davidsbatista Jan 9, 2025
e1464eb
handling some more edge cases, when split is still too big and all se…
davidsbatista Jan 9, 2025
3eb532c
fixing fallback whitespaces count to fixed word/char split based on s…
davidsbatista Jan 10, 2025
38fce46
cleaning
davidsbatista Jan 10, 2025
c5d8b2f
cleaning
davidsbatista Jan 10, 2025
89b7ad1
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion haystack/components/preprocessors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .document_cleaner import DocumentCleaner
from .document_splitter import DocumentSplitter
from .nltk_document_splitter import NLTKDocumentSplitter
from .recursive_splitter import RecursiveDocumentSplitter
from .text_cleaner import TextCleaner

__all__ = ["DocumentSplitter", "DocumentCleaner", "TextCleaner", "NLTKDocumentSplitter"]
__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner", "NLTKDocumentSplitter"]
2 changes: 1 addition & 1 deletion haystack/components/preprocessors/document_splitter.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
logger = logging.getLogger(__name__)

# Maps the 'split_by' argument to the actual char used to split the Documents.
# 'function' is not in the mapping cause it doesn't split on chars.
# 'function' is not in the mapping because it doesn't split on chars.
_SPLIT_BY_MAPPING = {"page": "\f", "passage": "\n\n", "sentence": ".", "word": " ", "line": "\n"}


Expand Down
241 changes: 241 additions & 0 deletions haystack/components/preprocessors/recursive_splitter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

import re
from copy import deepcopy
from typing import Dict, List, Optional

from haystack import Document, component, logging

logger = logging.getLogger(__name__)


@component
class RecursiveDocumentSplitter:
"""
Recursively chunk text into smaller chunks.

This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
to the text.

The separators are applied in the order they are provided, typically this is a list of separators that are
applied in a specific order, being the last separator the most specific one.

Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that
are within the chunk_size, for the ones that are larger than the chunk_size, it applies the next separator in the
list to the remaining text.

This is done until all chunks are smaller than the chunk_size parameter.

Example:

```python
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
from haystack import Document
from haystack.components.preprocessors import RecursiveChunker
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved

chunker = RecursiveChunker(chunk_size=260, chunk_overlap=0, separators=["\n\n", "\n", ".", " "], keep_separator=True)
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
text = '''Artificial intelligence (AI) - Introduction

AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.'''

davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
doc = Document(content=text)
doc_chunks = chunker.run([doc])
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'}),
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. ', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'}),
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'}),
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'})
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
>]
""" # noqa: E501

def __init__( # pylint: disable=too-many-positional-arguments
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
self,
split_length: int = 200,
split_overlap: int = 0,
separators: Optional[List[str]] = None,
keep_separator: bool = True,
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
):
"""
Initializes a RecursiveDocumentSplitter.

:param split_length: The maximum length of each chunk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  • one we should make it clear by what unit we are measuring length. In this case it appears to be number of characters.
  • second I don't think we should measure chunks by number of characters. I think making it by the number of words makes more intuitive sense especially for the types of separators we recommend by default. Optionally we could add a new variable to choose between character and word counting.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this component thinking about counting by characters. I agree, counting by words is more intuitive in some cases, but, a few points:

  • what's exactly a word, do words connected with "-" count as one or more?
  • since we are splitting by regex, we would then also need to account for the case where a word is split by regex, does it count for one word or not.
  • or we force the regexes to always split by word?
  • there are probably other issues we would need to think about

I think it's a valid point to use words instead of characters, but I'm afraid this would make this PR snowball into something even bigger. Maybe we can aim to close this for now, bring this to an end, and then think about the extra changes and requests?

Copy link
Contributor

@sjrl sjrl Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry, just missed your response on this until now.

I see your concerns, but I think overall we can just use a naive way of counting words and split by " ". By no means do I think this counting needs to be perfect and we can even describe that we do it this way. I just think it's more intuitive as you say to think about the length of a chunk based on something like words rather than characters.

So to sum up I think it would be fine to just count the number of words naively and not to worry about the cases you outlined. Just a basic

len_chunk = len(chunk.split(" "))

would do.

I'd also be for allowing a user to choose what type thing they'd like to count (e.g. character, word, tokens, etc.) but if we think this is a good idea then I think we should set up a framework for that now (e.g. make it easy for us to add these parameters to the init method) so we don't have introduce breaking changes in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - we can go with the simple space-based words for now - I will work on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a simple approach - I still need now to add a few more tests, but it seems to be working.

:param split_overlap: The number of characters to overlap between consecutive chunks.
:param separators: An optional list of separator strings to use for splitting the text. The string
separators will be treated as regular expressions un less if the separator is "sentence", in that case the
text will be split into sentences using a custom sentence tokenizer based on NLTK.
If no separators are provided, the default separators ["\n\n", "\n", ".", " "] are used.
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
:param keep_separator: Whether to keep the separator character in the resulting chunks.

davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
:raises ValueError: If the overlap is greater than or equal to the chunk size or if the overlap is negative, or
if any separator is not a string.
"""
self.split_length = split_length
self.split_overlap = split_overlap
self.separators = separators if separators else ["\n\n", "\n", ".", " "]
self.keep_separator = keep_separator
self._check_params()
if "sentence" in self.separators:
self.nltk_tokenizer = self._get_custom_sentence_tokenizer()

def _check_params(self):
if self.split_length < 1:
raise ValueError("Split length must be at least 1 character.")
if self.split_overlap < 0:
raise ValueError("Overlap must be greater than zero.")
if self.split_overlap >= self.split_length:
raise ValueError("Overlap cannot be greater than or equal to the chunk size.")
if not all(isinstance(separator, str) for separator in self.separators):
raise ValueError("All separators must be strings.")

@staticmethod
def _get_custom_sentence_tokenizer():
try:
from haystack.components.preprocessors.sentence_tokenizer import SentenceSplitter
except (LookupError, ModuleNotFoundError):
raise Exception("You need to install NLTK to use this function. You can install it via `pip install nltk`")
return SentenceSplitter()
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved

def _apply_overlap(self, chunks: List[str]) -> List[str]:
"""
Applies an overlap between consecutive chunks if the chunk_overlap attribute is greater than zero.

:param chunks: List of text chunks.
:returns:
The list of chunks with overlap applied.
"""
overlapped_chunks = []
for idx, chunk in enumerate(chunks):
if idx == 0:
overlapped_chunks.append(chunk)
continue
overlap_start = max(0, len(chunks[idx - 1]) - self.split_overlap)
current_chunk = chunks[idx - 1][overlap_start:] + chunk
overlapped_chunks.append(current_chunk)
return overlapped_chunks

def _chunk_text(self, text: str) -> List[str]:
"""
Recursive chunking algorithm that divides text into smaller chunks based on a list of separator characters.

It starts with a list of separator characters (e.g., ["\n\n", "\n", " ", ""]) and attempts to divide the text
using the first separator. If the resulting chunks are still larger than the specified chunk size, it moves to
the next separator in the list. This process continues recursively, progressively applying each specific
separator until the chunks meet the desired size criteria.

:param text: The text to be split into chunks.
:returns:
A list of text chunks.
"""
if len(text) <= self.split_length:
return [text]

for curr_separator in self.separators: # type: ignore # the caller already checked that separators is not None
if curr_separator == "sentence":
# using the custom NLTK-based sentence tokenizer
sentence_with_spans = self.nltk_tokenizer.split_sentences(text)
splits = [sentence["sentence"] for sentence in sentence_with_spans]
else:
# apply current separator regex to split text
escaped_separator = re.escape(curr_separator)
splits = re.split(escaped_separator, text)

if len(splits) == 1: # go to next separator, if current separator not found
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
continue

chunks = []
current_chunk: List[str] = []
current_length = 0

# check splits, if any is too long, recursively chunk it, otherwise add to current chunk
for idx, split in enumerate(splits):
split_text = split

# add separator to the split, if it's not the last split
if self.keep_separator and curr_separator != "sentence" and idx < len(splits) - 1:
split_text = split + curr_separator

# if adding this split exceeds chunk_size, process current_chunk
if current_length + len(split_text) > self.split_length:
if current_chunk: # keep the good splits
chunks.append("".join(current_chunk))
current_chunk = []
current_length = 0

# recursively handle splits that are too large
if len(split_text) > self.split_length:
if curr_separator == self.separators[-1]:
# tried the last separator, can't split further, break the loop and fall back to
# character-level chunking
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
break
chunks.extend(self._chunk_text(split_text))
else:
chunks.append(split_text)
else:
current_chunk.append(split_text)
current_length += len(split_text)

if current_chunk:
chunks.append("".join(current_chunk))

if self.split_overlap > 0:
chunks = self._apply_overlap(chunks)

if chunks:
return chunks

# if no separator worked, fall back to character-level chunking
return [text[i : i + self.split_length] for i in range(0, len(text), self.split_length - self.split_overlap)]

def _run_one(self, doc: Document) -> List[Document]:
new_docs: List[Document] = []
chunks = self._chunk_text(doc.content) # type: ignore # the caller already check for a non-empty doc.content
chunks = chunks[:-1] if len(chunks[-1]) == 0 else chunks # remove last empty chunk
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
current_position = 0
current_page = 1

for split_nr, chunk in enumerate(chunks):
new_doc = Document(content=chunk, meta=deepcopy(doc.meta))
new_doc.meta["original_id"] = doc.id
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
new_doc.meta["split_id"] = split_nr
new_doc.meta["split_idx_start"] = current_position
new_doc.meta["_split_overlap"] = [] if self.split_overlap > 0 else None
new_doc.meta["page_number"] = current_page

if split_nr > 0 and self.split_overlap > 0:
previous_doc = new_docs[-1]
overlap_length = len(previous_doc.content) - (current_position - previous_doc.meta["split_idx_start"]) # type: ignore
if overlap_length > 0:
previous_doc.meta["_split_overlap"].append({"doc_id": new_doc.id, "range": (0, overlap_length)})
new_doc.meta["_split_overlap"].append(
{
"doc_id": previous_doc.id,
"range": (len(previous_doc.content) - overlap_length, len(previous_doc.content)), # type: ignore
}
)

new_docs.append(new_doc)
current_page += chunk.count("\f") # update the page number based on the number of page breaks
current_position += len(chunk) - (self.split_overlap if split_nr < len(chunks) - 1 else 0)

return new_docs

@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> Dict[str, List[Document]]:
"""
Split documents into Documents with smaller chunks of text.

:param documents: List of Documents to split.
:returns:
A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
to the input documents.
"""
docs = []
for doc in documents:
if not doc.content or doc.content == "":
logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)
continue
docs.extend(self._run_one(doc))

return {"documents": docs}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
features:
- |
Adding a `RecursiveChunker,` which uses a set of separators to split text recursively. It attempts to divide the text using the first separator, if the resulting chunks are still larger than the specified size, it moves to the next separator in the list.
Loading
Loading