Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Parallel Embedding is not working on Windows Servers #414

Open
abdelkareemkobo opened this issue Nov 25, 2024 · 3 comments
Open

[Bug]: Parallel Embedding is not working on Windows Servers #414

abdelkareemkobo opened this issue Nov 25, 2024 · 3 comments

Comments

@abdelkareemkobo
Copy link

What happened?

image
I am trying to encode my dataset with multiple CUDA GPU but only one GPU is working

What is the expected behaviour?

all specified 4 GPU must work

A minimal reproducible example

embedding_model = LateInteractionTextEmbedding("jinaai/jina-colbert-v2",cuda=True,device_ids=[0,1,2,3])

descriptions_embeddings = list(embedding_model.embed(documents,parallel=4))

What Python version are you on? e.g. python --version

python3.11

FastEmbed version

v0.4.2

What os are you seeing the problem on?

No response

Relevant stack traces and/or logs

No response

@joein
Copy link
Member

joein commented Dec 4, 2024

hi @abdelkareemkobo,

parallel=4 does not span all the data across all available gpu by default, you need to initialize your model with
cuda and device_ids params, like

LateInteractionTextEmbedding(
        model_name=model_name,
        cuda=args.use_cuda,
        device_ids=device_ids,
        lazy_load=lazy_load
)

@Abdullahaml1
Copy link

Thanks @joein but I'm running in the same issue but in ubuntu. I'm not able to select the cuda devices to run indexing. It is not clear in the docs how to do indexing on multiple gpus on the same machine.

Here is snippet to reproduce using python 3.12

import time
from dataclasses import dataclass
from typing import Any
import os

from qdrant_client import QdrantClient
from datasets import load_dataset
from fastembed import TextEmbedding


@dataclass
class CollectionItem:
    text: str
    metadata: dict[str, Any] = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {'text': self.text}


@dataclass
class CollectionItemPool:
    items: list[CollectionItem]
    docs: list[str] = None
    metadata: list[dict] = None

    def __post_init__(self):
        if self.docs is None:
            self.docs = [i.text for i in self.items]

        if self.metadata is None:
            self.metadata = [i.metadata for i in self.items]


def prepare_dataset(limit: int = None) -> CollectionItemPool:
    en_ds = load_dataset("allenai/c4", "en", split='train', streaming=True)

    if limit is not None:
        assert isinstance(limit, int), (
            f'`limit` has to be integer got {type(limit)}')

        ds = en_ds
        # ds = en_ds.select(range(limit))

        items: CollectionItem = []
        for idx, ds_item in enumerate(ds):
            if idx == limit:
                break

            item = CollectionItem(text=ds_item['text'])
            items.append(item)

        return CollectionItemPool(items=items)
    return None


if __name__ == '__main__':
    # setting cuda devices
    # os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

    # Initialize the client
    client = QdrantClient(":memory:")  # or QdrantClient(path="path/to/db")
    embedding_model_gpu = TextEmbedding(
        model_name="intfloat/multilingual-e5-large",
        providers=["CUDAExecutionProvider"],
        device_ids=[1, 2, 3],
        cuda=True,
        lazy_load=True
        # model_name="BAAI/bge-big-en-v1.5", providers=["CUDAExecutionProvider"]
    )
    print('Base Class')
    print(embedding_model_gpu.__class__.__bases__)
    # print(embedding_model_gpu.model.model.get_providers())
    print('Done loading embedding model on GPU')

    print('Loading Dataset')
    items_pool = prepare_dataset(limit=1024)
    print('Done loading dataset')

    start_idx_time = time.time()
    print('Start Indexing ..')
    # every embedding is numpy array oject
    embeds = embedding_model_gpu.embed(items_pool.docs, batch_size=256)
    end_idx_time = time.time()
    for embed in embeds:
        print(type(embed))
        print(embed.shape)
        # print(embed)  # numpy array
        break
    print(f'End Indexing in {end_idx_time - start_idx_time:4f}')

Here I set device_ids to [1, 2, 3], but fastembed still running on device 0. If you increase the batch size we will get

Failed to allocate memory for requested buffer of size 17179869184

@hh-space-invader
Copy link
Contributor

hh-space-invader commented Dec 17, 2024

@Abdullahaml1 Please ensure that the parallel argument in .embed is == len(device_ids). In your example its 3.
The reason for that, parallel enables multi-GPU support by spawning child processes for each GPU specified in device_ids. To ensure proper utilization, the value of parallel must match the number of GPUs in device_ids. If using a single GPU, this parameter is not necessary.
It is also required to use the cuda=True argument when configuring the model, without explicitly specifying providers.
cuda and providers are mutually exclusive parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants