Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RecursiveSplitter component for Document preprocessing #8605

Merged
merged 100 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
a49fc93
initial import
davidsbatista Nov 20, 2024
41f5f64
initial import
davidsbatista Nov 20, 2024
79c669e
wip
davidsbatista Nov 20, 2024
87b8023
adding initial version + tests
davidsbatista Nov 25, 2024
09b25f3
adding more tests
davidsbatista Dec 2, 2024
a39f481
more tests
davidsbatista Dec 2, 2024
4e9b4ea
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 4, 2024
db82194
incorporating SentenceSplitter based on NLTK
davidsbatista Dec 4, 2024
cbfcc66
adding more tests
davidsbatista Dec 4, 2024
74de92c
adding release notes
davidsbatista Dec 4, 2024
4054c47
adding LICENSE header
davidsbatista Dec 4, 2024
6b72a17
removing unused imports
davidsbatista Dec 4, 2024
4c0afb1
fixing example docstring
davidsbatista Dec 4, 2024
24739be
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 4, 2024
8e62968
addding docstrings
davidsbatista Dec 4, 2024
12549bd
fixing tests and returning a dictionary
davidsbatista Dec 4, 2024
20a7f52
updating release notes
davidsbatista Dec 4, 2024
323319b
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 5, 2024
5945e6d
attending PR comments
davidsbatista Dec 6, 2024
01ad974
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 9, 2024
eaf9b77
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 9, 2024
b5391f6
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 10, 2024
adf1b1a
wip: updating tests for split_idx_start and _split_overlap
davidsbatista Dec 10, 2024
d4a2a0b
adding tests for split_idx and split_start and overlaps
davidsbatista Dec 11, 2024
aed28c5
adjusting file for LICENSE checking
davidsbatista Dec 11, 2024
eb5afb5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 11, 2024
824142f
adding more tests
davidsbatista Dec 11, 2024
e4815d8
adding tests for page numbering
davidsbatista Dec 11, 2024
8f1ae36
adding tests for min split lenghts and falling back to character-leve…
davidsbatista Dec 11, 2024
5a49eab
fixing linting issue
davidsbatista Dec 11, 2024
a5c1f2c
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
2248135
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
4263352
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
6ee5551
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
0325a8b
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
b2b94b5
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
85f2ea2
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
644056f
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 12, 2024
3cb85d9
wip
davidsbatista Dec 12, 2024
459bfa7
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 12, 2024
42faf05
wip
davidsbatista Dec 12, 2024
7d9c4df
updating tests
davidsbatista Dec 12, 2024
d66afd5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 12, 2024
5bcf709
wip: fixing all tests after changes
davidsbatista Dec 12, 2024
9205ef2
more tests
davidsbatista Dec 12, 2024
437570f
wip: debugging sentence overlap
davidsbatista Dec 12, 2024
97437d8
wip: debugging page number
davidsbatista Dec 13, 2024
13f85e1
wip
davidsbatista Dec 16, 2024
eebe1a0
wip; fixed bug with sentence tokenizer, needs to keep white spaces
davidsbatista Dec 16, 2024
3f00b3b
adding tests for counting pages on different split approaches
davidsbatista Dec 16, 2024
d9addfa
NLTK checks done on SentenceSplitter
davidsbatista Dec 16, 2024
080a529
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 16, 2024
c3f09d0
fixing types
davidsbatista Dec 16, 2024
2df40c3
adding detecting for full overlap with previous chunks
davidsbatista Dec 16, 2024
0492025
fixing types
davidsbatista Dec 16, 2024
09362e4
improving docstring
davidsbatista Dec 16, 2024
eb38a2b
improving docstring
davidsbatista Dec 16, 2024
a418f73
adding custom lenght, 'character' use case
davidsbatista Dec 17, 2024
71ce15b
customising overlap function for word and adding a few tests
davidsbatista Dec 17, 2024
3a9d290
updating docstring
davidsbatista Dec 17, 2024
f35d4e5
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 17, 2024
938b610
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
bc4dfbd
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
371028c
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Dec 19, 2024
79cd8bd
wip: adding more tests for word unit length
davidsbatista Dec 17, 2024
31c8412
fix
davidsbatista Dec 17, 2024
e1fed92
feat: `Tool` dataclass - unified abstraction to represent tools (#8652)
anakin87 Dec 18, 2024
f71a22b
fix: fix deserialization issues in multi-threading environments (#8651)
wochinge Dec 18, 2024
211c4ed
adding 'word' as default length
davidsbatista Dec 19, 2024
0807902
fixing types
davidsbatista Dec 19, 2024
460cc7d
handing both default strategies
davidsbatista Dec 19, 2024
7901af5
wip
davidsbatista Dec 19, 2024
2af6b03
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 19, 2024
8a09157
\f was not being counted properly
davidsbatista Dec 19, 2024
3ad73a5
updating tests
davidsbatista Dec 20, 2024
d292de6
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 20, 2024
b09154e
fixing the overlap bug
davidsbatista Dec 20, 2024
bd67369
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 20, 2024
c1fa6c2
adding more tests
davidsbatista Dec 21, 2024
de5e951
refactoring _apply_overlap
davidsbatista Dec 21, 2024
81c7c89
further refactoring
davidsbatista Dec 21, 2024
e398120
Merge branch 'main' into add-recursive-chunking
davidsbatista Dec 21, 2024
6977b2a
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 3, 2025
50ac7af
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
602ac9b
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
78ebc71
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
a6a2475
Update haystack/components/preprocessors/recursive_splitter.py
davidsbatista Jan 8, 2025
80b8f2c
adding ticks to close code block
davidsbatista Jan 8, 2025
2040c7c
fixing comments
davidsbatista Jan 8, 2025
977de8e
applying changes: split with space and force keep_white_spaces=True
davidsbatista Jan 8, 2025
c4ada43
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 8, 2025
d87ffe6
fixing some tests and replacing count words approach in more places
davidsbatista Jan 8, 2025
df214d6
keep_white_spaces = True only if not defined
davidsbatista Jan 9, 2025
25721bb
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 9, 2025
951956b
cleaning docs
davidsbatista Jan 9, 2025
e1464eb
handling some more edge cases, when split is still too big and all se…
davidsbatista Jan 9, 2025
3eb532c
fixing fallback whitespaces count to fixed word/char split based on s…
davidsbatista Jan 10, 2025
38fce46
cleaning
davidsbatista Jan 10, 2025
c5d8b2f
cleaning
davidsbatista Jan 10, 2025
89b7ad1
Merge branch 'main' into add-recursive-chunking
davidsbatista Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion haystack/components/preprocessors/document_splitter.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
logger = logging.getLogger(__name__)

# Maps the 'split_by' argument to the actual char used to split the Documents.
# 'function' is not in the mapping cause it doesn't split on chars.
# 'function' is not in the mapping because it doesn't split on chars.
_SPLIT_BY_MAPPING = {"page": "\f", "passage": "\n\n", "sentence": ".", "word": " ", "line": "\n"}


Expand Down
212 changes: 212 additions & 0 deletions haystack/components/preprocessors/recursive_chunker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

import re
from typing import Any, Dict, List

from haystack import Document, component, default_from_dict, default_to_dict, logging

logger = logging.getLogger(__name__)


@component
class RecursiveChunker:
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
"""
Recursively chunk text into smaller chunks.

This component is used to split text into smaller chunks, it does so by recursively applying a list of separators
to the text.

Each separator is applied to the text, if then checks each of the resulting chunks, it keeps the ones chunks that
are within the chunk_size, for the ones that are larger than the chunk_size, it applies the next separator in the
list to the remaining text.

This is done until all chunks are smaller than the chunk_size parameter.

Example:

```python
from haystack import Document
from haystack.components.preprocessors.recursive_chunker import RecursiveChunker
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved

chunker = RecursiveChunker(chunk_size=260, chunk_overlap=0, separators=["\n\n", "\n", ".", " "], keep_separator=True)
text = '''Artificial intelligence (AI) - Introduction

AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.'''

doc = Document(content=text)
doc_chunks = chunker.run([doc])
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'}),
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems. ', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'}),
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'}),
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '65167a9823dd883de577e828ca4fd529e6f7241f0ff616acfce454d808478951'})
>]
""" # noqa: E501

def __init__( # pylint: disable=too-many-positional-arguments
self,
chunk_size: int,
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
chunk_overlap: int,
separators: List[str],
keep_separator: bool = True,
is_separator_regex: bool = False,
):
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.separators = separators
self.keep_separator = keep_separator
self.is_separator_regex = is_separator_regex
self._check_params()
if "sentence" in separators:
self.nltk_tokenizer = self._get_custom_sentence_tokenizer()

def _check_params(self):
if self.chunk_overlap < 0:
raise ValueError("Overlap must be greater than zero.")
if self.chunk_overlap >= self.chunk_size:
raise ValueError("Overlap cannot be greater than or equal to the chunk size.")

@staticmethod
def _get_custom_sentence_tokenizer():
try:
from haystack.components.preprocessors.sentence_tokenizer import SentenceSplitter
except (LookupError, ModuleNotFoundError):
raise Exception("You need to install NLTK to use this function. You can install it via `pip install nltk`")
return SentenceSplitter(language="en")
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved

def _apply_overlap(self, chunks: List[str]) -> List[str]:
"""
Applies an overlap between consecutive chunks if the chunk_overlap attribute is greater than zero.
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved

:param chunks: List of text chunks.
:returns:
The list of chunks with overlap applied.
"""
overlapped_chunks = []
for idx, chunk in enumerate(chunks):
if idx == 0:
overlapped_chunks.append(chunk)
continue
overlap_start = max(0, len(chunks[idx - 1]) - self.chunk_overlap)
current_chunk = chunks[idx - 1][overlap_start:] + chunk
overlapped_chunks.append(current_chunk)
return overlapped_chunks

def _chunk_text(self, text: str) -> List[str]:
"""
Recursive chunking algorithm that divides text into smaller chunks based on a list of separator characters.

It starts with a list of separator characters (e.g., ["\n\n", "\n", " ", ""]) and attempts to divide the text
using the first separator. If the resulting chunks are still larger than the specified chunk size, it moves to
the next separator in the list. This process continues recursively, progressively applying each specific
separator until the chunks meet the desired size criteria.

:param text:
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
:returns:
A list of text chunks.
"""
if len(text) <= self.chunk_size:
return [text]

# try each separator
for separator in self.separators:
if separator in "sentence": # using nltk sentence tokenizer
sentence_with_spans = self.nltk_tokenizer.split_sentences(text)
splits = [sentence["sentence"] for sentence in sentence_with_spans]
else:
# split using the current separator
splits = text.split(separator) if not self.is_separator_regex else re.split(separator, text)

# filter out empty splits
splits = [s for s in splits if s.strip()]

if len(splits) == 1: # go to next separator, if current separator not found
continue

chunks = []
current_chunk: List[str] = []
current_length = 0

# check splits, if any is too long, recursively chunk it, otherwise add to current chunk
for split in splits:
split_text = split
if self.keep_separator and separator != "sentence":
split_text = split + separator

# if adding this split exceeds chunk_size, process current_chunk
if current_length + len(split_text) > self.chunk_size:
if current_chunk: # keep the good splits
chunks.append("".join(current_chunk))
current_chunk = []
current_length = 0

# recursively handle splits that are too large
if len(split_text) > self.chunk_size:
chunks.extend(self._chunk_text(split_text))
else:
chunks.append(split_text)
else:
current_chunk.append(split_text)
current_length += len(split_text)

if current_chunk:
chunks.append("".join(current_chunk))

if self.chunk_overlap > 0:
chunks = self._apply_overlap(chunks)

return chunks

# if no separator worked, fall back to character-level chunking
return [text[i : i + self.chunk_size] for i in range(0, len(text), self.chunk_size - self.chunk_overlap)]

def to_dict(self) -> Dict[str, Any]:
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
"""
Serializes the RecursiveChunker instance to a dictionary.
"""
return default_to_dict(
self,
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
separators=self.separators,
keep_separator=self.keep_separator,
is_separator_regex=self.is_separator_regex,
)

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "RecursiveChunker":
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
"""
Deserializes a dictionary to a RecursiveChunker instance.
"""
return default_from_dict(cls, data)

def _run_one(self, doc: Document) -> List[Document]:
new_docs = []
# NOTE: the check for a non-empty content is already done in the run method, hence the type ignore below
chunks = self._chunk_text(doc.content) # type: ignore
for chunk in chunks:
new_doc = Document(content=chunk, meta=doc.meta)
new_doc.meta["original_id"] = doc.id
new_docs.append(new_doc)
return new_docs

@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> Dict[str, List[Document]]:
"""
Split documents into Documents with smaller chunks of text.

:param documents: List of Documents to split.
:returns:
A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding
to the input documents.
"""
new_docs = []
for doc in documents:
if not doc.content or doc.content == "":
logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)
continue
new_docs.extend(self._run_one(doc))
return {"documents": new_docs}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
features:
- |
Adding a `RecursiveChunker,` which uses a set of separators to split text recursively. It attempts to divide the text using the first separator, if the resulting chunks are still larger than the specified size, it moves to the next separator in the list.
Loading
Loading