Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

Open
HectorBst opened this issue Dec 22, 2024 · 0 comments
Open

LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

HectorBst opened this issue Dec 22, 2024 · 0 comments

Comments

@HectorBst
Copy link

HectorBst commented Dec 22, 2024

Feature description

I don't know if having helpers or default transformers is something being considered and if the core library is the right place for it, but here is the request.

LangChain and LlamaIndex are frameworks widely used in the construction of RAG solutions. In particular, they each define abstract Embeddings and Splitter/Parser interfaces, implemented for different technologies or approaches by themselves or partners (OpenAI, Hugging Face, Bedrock, etc.).

The idea is to provide Transformers based on the abstract interfaces and taking developer-supplied implementations.

This will make it easier to integrate dlt with these frameworks for these needs. Moreover, these Transformers could be agnostic and compatible with all existing Vector Store destinations, so there's no need to make an embedding implementation for every type of Vector Store.

This could also avoid having to make limiting choices if there is a need to provide a default embedding or splitter Transformer in dlt core (e.g. choosing to implement a default embedding with OpenAI only, or to only use a SemanticChunker by default). We can still optionnaly provide few additional transformers, e.g. using the LangChain embeddings and the LangChain OpenAI implementation for convenient or example if needed.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

  • I want to easily use an embedding model to compute vectors, using an existing LangChain / LlamaIndex implementation.
  • I want to reuse existing LangChain / LlamaIndex code.
  • I want to be able to switch easily from one Embedding model to another (e.g. for local testing or future architecture changes), or to switch from one type of chunking to another.

Proposed solution

Transformers based on the abstract Embeddings and Splitter/Parser interfaces of LangChain and LlamaIndex, and taking developer-supplied implementations.

For chunking, transfomers would take texts from the ressource or a previous transformer, split them into chunks using the supplied implementation, and return the chunks.
For embedding, transfomers would take texts from the ressource or a previous transformer, call the embedding model using the supplied implementation to get vectors, and return the texts and vectors.

Then the vectors and chunks are handled by the destination (or the next transformer).

Related issues

#576

#1615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant