Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] embedding_model is not a valid embedding model', 'Please install the semantic extra to use this feature' #104

Closed
universewill opened this issue Dec 24, 2024 · 4 comments
Assignees
Labels
bug Something isn't working in progress Actively looking into the issue question Further information is requested

Comments

@universewill
Copy link

universewill commented Dec 24, 2024

I installed semantic by "pip install chonkie[semantic]" and still get this error

from chonkie import TokenChunker, SemanticChunker, SentenceChunker

chunker = SemanticChunker(
     embedding_model="all-minilm-l6-v2",
     chunk_size=512,
     similarity_threshold=0.7
)

chunks = chunker("Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe. The pandas is a creature. It has white and black fur. Its' very cute")

for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")
@bhavnicksm bhavnicksm changed the title embedding_model is not a valid embedding model', 'Please install the semantic extra to use this feature' [BUG] embedding_model is not a valid embedding model', 'Please install the semantic extra to use this feature' Dec 24, 2024
@bhavnicksm bhavnicksm self-assigned this Dec 24, 2024
@bhavnicksm bhavnicksm added bug Something isn't working question Further information is requested labels Dec 24, 2024
@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Dec 24, 2024

Hey @universewill!

Thanks for opening an issue~ 😊

The issue you are seeing is because, by default the semantic module assumes you are going to use a Model2VecEmbeddings (Chonkie uses model2vec because of their speed and light-weightedness).

When you pass in all-minilm-l6-v2, it requires the sentence-transformer dependency, which can be installed with pip install "chonkie[st]". Here's an image from the DOCS Installation page that describes this behaviour:

image

The error message does not know which dependency you are missing for the particular model you are trying, so it defaults to saying the above.

Let me know if you're able to use it with chonkie[st]~

Thanks! ☺️

@bhavnicksm
Copy link
Collaborator

from chonkie import TokenChunker, SemanticChunker, SentenceChunker

chunker = SemanticChunker(
     embedding_model="all-minilm-l6-v2",
     chunk_size=512,
     similarity_threshold=0.7
)

Also, just wanted to point out that similarity_threshold is no longer used, and instead we are using threshold, please refer to the DOCS for the API reference.

@bhavnicksm
Copy link
Collaborator

Update: If you add an unknown keyword argument to the init, then the SentenceTransformerEmbeddings fail. This is problematic.

I'll look into this and add a patch for this, asap.

@bhavnicksm bhavnicksm added the in progress Actively looking into the issue label Dec 24, 2024
@bhavnicksm
Copy link
Collaborator

Added a patch to have better error message when seeing an unknown kwarg causing failiure; closing issue~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress Actively looking into the issue question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants