-
Notifications
You must be signed in to change notification settings - Fork 109
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhance SemanticChunker with error handling and similarity threshold …
…updates - Added error handling for missing embedding model, prompting installation of the `semantic` extra. - Updated similarity threshold assignment to use the instance variable consistently. - Introduced a new test for SDPMChunker to validate functionality with percentile-based similarity, ensuring proper chunking behavior and attributes.
- Loading branch information
1 parent
fb37573
commit 719e33b
Showing
3 changed files
with
47 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
"""Factory class for creating and managing tokenizers. | ||
This factory class is used to create and manage tokenizers for the Chonkie | ||
package. It provides a simple interface for initializing, encoding, decoding, | ||
and counting tokens using different tokenizer backends. | ||
This is used in the Chunker and Refinery classes to ensure consistent tokenization | ||
across different parts of the pipeline. | ||
""" | ||
|
||
from typing import Callable, List, TYPE_CHECKING | ||
|
||
|
||
if TYPE_CHECKING: | ||
import tiktoken | ||
from transformers import AutoTokenizer | ||
from tokenizers import Tokenizer | ||
|
||
class TokenFactory: | ||
"""Factory class for creating and managing tokenizers.""" | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters