Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

Closed
armsp opened this issue Dec 16, 2024 · 6 comments
Assignees
Labels
bug Something isn't working in progress Actively looking into the issue

Comments

@armsp
Copy link

armsp commented Dec 16, 2024

The general idea is that because of merging SDPM will have lesser or same number of chunks as Semantic Chunker. Its clearly visible when we use standard Sentence Transformer models (mpnet or mini LM etc).

However I made a CustomEmbedding class for some recent models and talking specifically about BGE-M3 I see that no matter what I do, the chunks between SDPM and Semantic remain the same. I tried printing the similarities, emebeddings etc and I do see differences but for some reason I do not see different chunks when I do belive that SDPM should merge some chunks.

Setup: Install BGEM3FlagModel - pip install -U FlagEmbedding

Custom Embedding Class : (please don't mind the quick and dirty implementation, had to test fast)

class CustomEmbeddings(BaseEmbeddings):
    def __init__(self):
        self.model = BGEM3FlagModel("./bge-m3", use_fp16=True)
        self.task = "separation"
    
    @property
    def dimension(self) -> int:
        return 1024

    def embed(self, text: str) -> "np.ndarray":
        e = self.model.encode([text], return_dense=True, return_sparse=False, return_colbert_vecs=False)['dense_vecs'][0]
        # print(e)
        return e

    def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
        embeddings = self.model.encode(texts, return_dense=True, return_sparse=False, return_colbert_vecs=False
        )
        # print(embeddings['dense_vecs'])
        return embeddings['dense_vecs']

    def count_tokens(self, text: str) -> int:
        l = len(self.model.tokenizer.encode(text))
        # print(l)
        return l

    def count_tokens_batch(self, texts: List[str]) -> List[int]:
        encodings = self.model.tokenizer(texts)
        # print([len(enc) for enc in encodings["input_ids"]])
        return [len(enc) for enc in encodings["input_ids"]]

    def get_tokenizer_or_token_counter(self):
        return self.model.tokenizer
    
    def similarity(self, u: "np.ndarray", v: "np.ndarray") -> float:
        """Compute cosine similarity between two embeddings."""
        s = ([email protected])#.item()
        # print(s)
        return s
    
@classmethod
    def is_available(cls) -> bool:
        return True

    def __repr__(self) -> str:
        return "bgem3"

Code: You can use the paul graham essay as input text for chunking -> https://gist.githubusercontent.com/wey-gu/75d49362d011a0f0354d39e396404ba2/raw/0844351171751ebb1ce54ea62232bf5e59445bb7/paul_graham_essay.txt

from chonkie import SemanticChunker
from chonkie import SDPMChunker
from typing import List
import numpy as np
from FlagEmbedding import BGEM3FlagModel
from chonkie.embeddings import BaseEmbeddings

# New custom embedding code...
embeddings = CustomEmbeddings()

with open('./pg_essay.txt', 'r') as file:
    text = file.read()

chunker = SemanticChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print(f"Number of chunks: {len(chunks)}")
# for chunk in chunks:
#     print(f"Chunk text: {chunk.text}")
#     print(f"Token count: {chunk.token_count}")
#     print(f"Number of sentences: {len(chunk.sentences)}")

chunker = SDPMChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print("\n~~~~~~~~~  SDPM ~~~~~~~~~~~~~")
print(f"Number of chunks: {len(chunks)}")

No matter what I use for chunk_size and threshold, the number of chunks are the same.

For example: Using mpnet with the parameters above we get 384 and 372 (as expected) but for BGE-M3 we get 92 each.

@armsp armsp added the bug Something isn't working label Dec 16, 2024
@shreyashnigam shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024
@bhavnicksm
Copy link
Collaborator

Hey @armsp!

Thanks for opening an issue and for the detailed instructions for reproduction~! 😊

I'm looking into this issue now, I'll follow the instructions and reproduce the issue you are seeing; I will get back to you once I am able to recreate the issue.

Thanks! 😊

@bhavnicksm
Copy link
Collaborator

Hey @armsp!

I think the issue lies in your similarity function not being normalized. When I fixed it to cosine_similarity it seemed to give different chunks.

The easiest way to do so is to remove the implementation of similarity you've written in the class.

Please re-open this issue if this fix doesn't help~

Thanks 😊

@armsp
Copy link
Author

armsp commented Jan 6, 2025

I see... do you have the snippet of the change you made that gave you different results? Just the normalization part would be more than enough.

@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Jan 6, 2025

Hey @armsp,

If you're implementing it, then something like this should work!

float(np.dot(u, v.T) / (np.linalg.norm(u) * np.linalg.norm(v)))  # cosine similarity

This is what we use by default in the BaseEmbeddings for the similarity fn so when you don't override it, it should default to this as well.

Let me know if you're seeing the change with this

Thanks!

@armsp
Copy link
Author

armsp commented Jan 6, 2025

That seems to work :)

@bhavnicksm
Copy link
Collaborator

Awesome! 😎🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress Actively looking into the issue
Projects
None yet
Development

No branches or pull requests

3 participants