BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

armsp · 2024-12-16T11:39:51Z

The general idea is that because of merging SDPM will have lesser or same number of chunks as Semantic Chunker. Its clearly visible when we use standard Sentence Transformer models (mpnet or mini LM etc).

However I made a CustomEmbedding class for some recent models and talking specifically about BGE-M3 I see that no matter what I do, the chunks between SDPM and Semantic remain the same. I tried printing the similarities, emebeddings etc and I do see differences but for some reason I do not see different chunks when I do belive that SDPM should merge some chunks.

Setup: Install BGEM3FlagModel - pip install -U FlagEmbedding

Custom Embedding Class : (please don't mind the quick and dirty implementation, had to test fast)

class CustomEmbeddings(BaseEmbeddings):
    def __init__(self):
        self.model = BGEM3FlagModel("./bge-m3", use_fp16=True)
        self.task = "separation"
    
    @property
    def dimension(self) -> int:
        return 1024

    def embed(self, text: str) -> "np.ndarray":
        e = self.model.encode([text], return_dense=True, return_sparse=False, return_colbert_vecs=False)['dense_vecs'][0]
        # print(e)
        return e

    def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
        embeddings = self.model.encode(texts, return_dense=True, return_sparse=False, return_colbert_vecs=False
        )
        # print(embeddings['dense_vecs'])
        return embeddings['dense_vecs']

    def count_tokens(self, text: str) -> int:
        l = len(self.model.tokenizer.encode(text))
        # print(l)
        return l

    def count_tokens_batch(self, texts: List[str]) -> List[int]:
        encodings = self.model.tokenizer(texts)
        # print([len(enc) for enc in encodings["input_ids"]])
        return [len(enc) for enc in encodings["input_ids"]]

    def get_tokenizer_or_token_counter(self):
        return self.model.tokenizer
    
    def similarity(self, u: "np.ndarray", v: "np.ndarray") -> float:
        """Compute cosine similarity between two embeddings."""
        s = ([email protected])#.item()
        # print(s)
        return s
    
@classmethod
    def is_available(cls) -> bool:
        return True

    def __repr__(self) -> str:
        return "bgem3"

Code: You can use the paul graham essay as input text for chunking -> https://gist.githubusercontent.com/wey-gu/75d49362d011a0f0354d39e396404ba2/raw/0844351171751ebb1ce54ea62232bf5e59445bb7/paul_graham_essay.txt

from chonkie import SemanticChunker
from chonkie import SDPMChunker
from typing import List
import numpy as np
from FlagEmbedding import BGEM3FlagModel
from chonkie.embeddings import BaseEmbeddings

# New custom embedding code...
embeddings = CustomEmbeddings()

with open('./pg_essay.txt', 'r') as file:
    text = file.read()

chunker = SemanticChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print(f"Number of chunks: {len(chunks)}")
# for chunk in chunks:
#     print(f"Chunk text: {chunk.text}")
#     print(f"Token count: {chunk.token_count}")
#     print(f"Number of sentences: {len(chunk.sentences)}")

chunker = SDPMChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print("\n~~~~~~~~~  SDPM ~~~~~~~~~~~~~")
print(f"Number of chunks: {len(chunks)}")

No matter what I use for chunk_size and threshold, the number of chunks are the same.

For example: Using mpnet with the parameters above we get 384 and 372 (as expected) but for BGE-M3 we get 92 each.

The text was updated successfully, but these errors were encountered:

bhavnicksm · 2024-12-21T21:23:40Z

Hey @armsp!

Thanks for opening an issue and for the detailed instructions for reproduction~! 😊

I'm looking into this issue now, I'll follow the instructions and reproduce the issue you are seeing; I will get back to you once I am able to recreate the issue.

Thanks! 😊

bhavnicksm · 2025-01-06T22:51:24Z

Hey @armsp!

I think the issue lies in your similarity function not being normalized. When I fixed it to cosine_similarity it seemed to give different chunks.

The easiest way to do so is to remove the implementation of similarity you've written in the class.

Please re-open this issue if this fix doesn't help~

Thanks 😊

armsp · 2025-01-06T23:03:16Z

I see... do you have the snippet of the change you made that gave you different results? Just the normalization part would be more than enough.

bhavnicksm · 2025-01-06T23:05:45Z

Hey @armsp,

If you're implementing it, then something like this should work!

float(np.dot(u, v.T) / (np.linalg.norm(u) * np.linalg.norm(v)))  # cosine similarity

This is what we use by default in the BaseEmbeddings for the similarity fn so when you don't override it, it should default to this as well.

Let me know if you're seeing the change with this

Thanks!

armsp · 2025-01-06T23:18:49Z

That seems to work :)

bhavnicksm · 2025-01-07T12:04:36Z

Awesome! 😎🚀

armsp added the bug Something isn't working label Dec 16, 2024

armsp assigned bhavnicksm Dec 16, 2024

shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024

bhavnicksm closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

armsp commented Dec 16, 2024 •

edited

Loading

bhavnicksm commented Dec 21, 2024

bhavnicksm commented Jan 6, 2025

armsp commented Jan 6, 2025

bhavnicksm commented Jan 6, 2025 •

edited

Loading

armsp commented Jan 6, 2025

bhavnicksm commented Jan 7, 2025

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

Comments

armsp commented Dec 16, 2024 • edited Loading

bhavnicksm commented Dec 21, 2024

bhavnicksm commented Jan 6, 2025

armsp commented Jan 6, 2025

bhavnicksm commented Jan 6, 2025 • edited Loading

armsp commented Jan 6, 2025

bhavnicksm commented Jan 7, 2025

armsp commented Dec 16, 2024 •

edited

Loading

bhavnicksm commented Jan 6, 2025 •

edited

Loading