LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

HectorBst · 2024-12-22T20:46:21Z

Feature description

I don't know if having helpers or default transformers is something being considered and if the core library is the right place for it, but here is the request.

LangChain and LlamaIndex are frameworks widely used in the construction of RAG solutions. In particular, they each define abstract Embeddings and Splitter/Parser interfaces, implemented for different technologies or approaches by themselves or partners (OpenAI, Hugging Face, Bedrock, etc.).

The idea is to provide Transformers based on the abstract interfaces and taking developer-supplied implementations.

This will make it easier to integrate dlt with these frameworks for these needs. Moreover, these Transformers could be agnostic and compatible with all existing Vector Store destinations, so there's no need to make an embedding implementation for every type of Vector Store.

This could also avoid having to make limiting choices if there is a need to provide a default embedding or splitter Transformer in dlt core (e.g. choosing to implement a default embedding with OpenAI only, or to only use a SemanticChunker by default). We can still optionnaly provide few additional transformers, e.g. using the LangChain embeddings and the LangChain OpenAI implementation for convenient or example if needed.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

I want to easily use an embedding model to compute vectors, using an existing LangChain / LlamaIndex implementation.
I want to reuse existing LangChain / LlamaIndex code.
I want to be able to switch easily from one Embedding model to another (e.g. for local testing or future architecture changes), or to switch from one type of chunking to another.

Proposed solution

Transformers based on the abstract Embeddings and Splitter/Parser interfaces of LangChain and LlamaIndex, and taking developer-supplied implementations.

For chunking, transfomers would take texts from the ressource or a previous transformer, split them into chunks using the supplied implementation, and return the chunks.
For embedding, transfomers would take texts from the ressource or a previous transformer, call the embedding model using the supplied implementation to get vectors, and return the texts and vectors.

Then the vectors and chunks are handled by the destination (or the next transformer).

Related issues

#576

#1615

github-project-automation bot added this to dlt core library Dec 22, 2024

github-project-automation bot moved this to Todo in dlt core library Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

HectorBst commented Dec 22, 2024 •

edited

Loading

LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

LangChain / LlamaIndex Embeddings and Chunkers transformers #2176

Comments

HectorBst commented Dec 22, 2024 • edited Loading

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

HectorBst commented Dec 22, 2024 •

edited

Loading