You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Device: Macbook Pro
CPU: Apple M3 Pro
Memory: 18GB
OS: MacOS Sonoma
Problem
I'm working with one of your 80MB models, the embeddings are great, but the performance could be faster for my use case. I want to quantise the model to 8 bits to run faster. I've tried to do that with this code:
importtorchfromsentence_transformersimportSentenceTransformertorch.backends.quantized.engine='qnnpack'model=SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
quantized_model=torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # layers to quantizedtype=torch.qint8# quantization data type
)
To my surprise, this halved the model's performance!
I've searched your docs, but I can't find anything on the best way to quantise your models. Is there a standard approach I should be following?
The text was updated successfully, but these errors were encountered:
Are you referring to the throughput, i.e. inference speed? Or the evaluation performance on a benchmark of yours?
Something to note is that while int8 is commonly used for LLMs, it's primarily used to shrink the memory usage (at least, to my knowledge). Beyond that, I'm not very familiar with the quantize_dynamic quantization code from torch.
Another thing to consider is that a GPU might have solid int8 operations, but a CPU might not. I.e. it might be faster on GPU, but slower on CPU. I actually think this is the big difference.
The upcoming release will introduce some more options for speeding up your models:
[feat] Add lightning-fast StaticEmbedding module based on model2vec #2961 will add support for model2vec, i.e. a library for converting a model into a set of vectors. So, rather than "running inference", you'll just grab token embeddings from an EmbeddingBag and calculate the mean. In my tests, it was about 300x faster on CPU.
In the meantime, you can either experiment with those PRs if you're interested (you can install them directly with pip install git+https://github.com/PR_USER/sentence-transformers.git@NAME_OF_BRANCH), and otherwise you can use float16 (model.half()) or bfloat16 (model.bfloat16()), but those only help with GPUs.
Machine Specs
Device: Macbook Pro
CPU: Apple M3 Pro
Memory: 18GB
OS: MacOS Sonoma
Problem
I'm working with one of your 80MB models, the embeddings are great, but the performance could be faster for my use case. I want to quantise the model to 8 bits to run faster. I've tried to do that with this code:
To my surprise, this halved the model's performance!
I've searched your docs, but I can't find anything on the best way to quantise your models. Is there a standard approach I should be following?
The text was updated successfully, but these errors were encountered: