-
Notifications
You must be signed in to change notification settings - Fork 303
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: tryptofanik <[email protected]> Co-authored-by: Olivier Ruas <[email protected]> GitOrigin-RevId: a1f7fbb44ca9caa5885fbfcd782a237b26064055
- Loading branch information
1 parent
7ee332d
commit 0aa6dbb
Showing
9 changed files
with
251 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
33 changes: 33 additions & 0 deletions
33
docs/2.developers/4.user-guide/50.llm-xpack/50.splitters.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
--- | ||
title: 'Chunking text' | ||
description: 'Splitters available through the Pathway xpack' | ||
date: '2025-02-04' | ||
thumbnail: '' | ||
tags: ['splitters', 'chunking'] | ||
keywords: ['parsers', 'chunking'] | ||
--- | ||
|
||
# Chunking | ||
|
||
Embedding entire documents as a single vector often leads to poor retrieval performance. This happens because the model is forced to compress all the document's information into a single representation, making it difficult to capture granular details. As a result, important context may be lost, and retrieval effectiveness decreases. | ||
|
||
There a several strategies how to best chunk a document. A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting. | ||
|
||
A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines. | ||
|
||
## TokenCountSplitter | ||
Pathway offers a [`TokenCountSplitter`](/developers/api-docs/pathway-xpacks-llm/splitters#pathway.xpacks.llm.splitters.TokenCountSplitter) for token-based chunking. Here's how to use it: | ||
|
||
```python | ||
from pathway.xpacks.llm.splitters import TokenCountSplitter | ||
|
||
text_splitter = TokenCountSplitter( | ||
min_tokens=100, | ||
max_tokens=500, | ||
encoding_name="cl100k_base" | ||
) | ||
``` | ||
|
||
This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI's embedding models. | ||
|
||
For more on token encodings, refer to [OpenAI's tiktoken guide](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#encodings). |
104 changes: 104 additions & 0 deletions
104
docs/2.developers/4.user-guide/50.llm-xpack/60.embedders.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
--- | ||
title: 'Embedders' | ||
description: 'Embedders available through the Pathway xpack' | ||
date: '2025-02-04' | ||
thumbnail: '' | ||
tags: ['tutorial', 'embedder'] | ||
keywords: ['LLM', 'GPT', 'OpenAI', 'Gemini', 'LiteLLM', 'Embedder'] | ||
--- | ||
|
||
# Embedders | ||
|
||
When storing a document in a vector store, you compute the embedding vector for the text and store the vector with a reference to the original document. You can then compute the embedding of a query and find the embedded documents closest to the query. | ||
|
||
The following embedding wrappers are available through the Pathway xpack: | ||
|
||
- [`OpenAIEmbedder`](#openaiembedder) - Embed text with any of OpenAI's embedding models | ||
- [`LiteLLMEmbedder`](#litellmembedder) - Embed text with any model available through LiteLLM | ||
- [`SentenceTransformersEmbedder`](#sentencetransformerembedder) - Embed text with any model available through SentenceTransformer (aka. SBERT) maintained by Hugging Face | ||
- [`GeminiEmbedder`](#gemeniembedder) - Embed text with any of Google's available embedding models | ||
|
||
## OpenAIEmbedder | ||
The default model for [`OpenAIEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.OpenAIEmbedder) is `text-embedding-3-small`. | ||
|
||
```python | ||
import os | ||
import pathway as pw | ||
from pathway.xpacks.llm.parsers import UnstructuredParser | ||
from pathway.xpacks.llm.embedders import OpenAIEmbedder | ||
|
||
files = pw.io.fs.read( | ||
os.environ.get("DATA_DIR"), | ||
mode="streaming", | ||
format="binary", | ||
autocommit_duration_ms=50, | ||
) | ||
|
||
# Parse the documents in the specified directory | ||
parser = UnstructuredParser(chunking_mode="paged") | ||
documents = files.select(elements=parser(pw.this.data)) | ||
documents = documents.flatten(pw.this.elements) # flatten list into multiple rows | ||
documents = documents.select(text=pw.this.elements[0], metadata=pw.this.elements[1]) | ||
|
||
# Embed each page of the document | ||
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"]) | ||
embeddings = documents.select(embedding=embedder(pw.this.text)) | ||
``` | ||
|
||
## LiteLLMEmbedder | ||
The model for [`LiteLLMEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.LiteLLMEmbedder) has to be specified during initialization. No default is provided. | ||
|
||
```python | ||
from pathway.xpacks.llm import embedders | ||
|
||
embedder = embedders.LiteLLMEmbedder( | ||
model="text-embedding-3-small", api_key=API_KEY | ||
) | ||
# Create a table with one column for the text to embed | ||
t = pw.debug.table_from_markdown( | ||
""" | ||
text_column | ||
Here is some text | ||
""" | ||
) | ||
res = t.select(ret=embedder(pw.this.text_column)) | ||
``` | ||
|
||
## SentenceTransformerEmbedder | ||
This [`SentenceTransformerEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.SentenceTransformerEmbedder) embedder allows you to use the models from the Hugging Face Sentence Transformer models. | ||
|
||
The model is specified during initialization. Here is a list of [`available models`](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html). | ||
|
||
```python | ||
import pathway as pw | ||
from pathway.xpacks.llm import embedders | ||
|
||
embedder = embedders.SentenceTransformerEmbedder(model="intfloat/e5-large-v2") | ||
|
||
# Create a table with text to embed | ||
t = pw.debug.table_from_markdown(''' | ||
txt | ||
Some text to embed | ||
''') | ||
|
||
# Extract the embedded text | ||
t.select(ret=embedder(pw.this.txt)) | ||
``` | ||
|
||
## GemeniEmbedder | ||
[`GemeniEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.GeminiEmbedder) is the embedder for Google's Gemeni Embedding Services. Available models can be found [`here`](https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding). | ||
|
||
```python | ||
import pathway as pw | ||
from pathway.xpacks.llm import embedders | ||
|
||
embedder = embedders.GeminiEmbedder(model="models/text-embedding-004") | ||
|
||
# Create a table with a column for the text to embed | ||
t = pw.debug.table_from_markdown(''' | ||
txt | ||
Some text to embed | ||
''') | ||
|
||
t.select(ret=embedder(pw.this.txt)) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
103 changes: 103 additions & 0 deletions
103
docs/2.developers/4.user-guide/50.llm-xpack/80.rerankers.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
--- | ||
title: 'Rerankers' | ||
description: 'Rerankers available through the Pathway xpack' | ||
date: '2025-02-04' | ||
thumbnail: '' | ||
tags: ['tutorial', 'reranker'] | ||
keywords: ['LLM', 'Reranker'] | ||
--- | ||
|
||
# Rerankers | ||
Rerankers can be used to have the model rank the relevance of documents against a query or text. | ||
|
||
In RAG systems, initial sparse retrieval is often based on cosine similarity. This will likely result in some documents not being relevant to the query. This happens because retrieval is typically based on vector representations that condense a passage's meaning into a single embedding, which can overlook important nuances. While this approach is fast, it can also be inaccurate. To improve the quality of retrieved documents, it is common to refine the initial set and select only the most relevant ones using rerankers. | ||
Rerankers help the model reassess and prioritize documents based on their relevance to the query. This is usually done by presenting the model with `(query, document)` pairs and evaluating whether the given `document` contributes to answering the `query`. | ||
|
||
Pathway xpack provides the following rerankers: | ||
- [`LLMReranker`](#llmreranker) - Have an LLM rerank the documents | ||
- [`CrossEncoderReranker`](#crossencoderreranker) - Rerank with a CrossEncoder from SBERT / SentenceTransformers) | ||
- [`EncoderReranker`](#encoderreranker): Rerank with SentenceTransformers EncoderRerank (measure similarity) | ||
|
||
## LLMReranker | ||
The [`LLMReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.LLMReranker) asks the provided LLM to evaluate the relevance of a query against the provided documents on a scale 1-5. | ||
|
||
```python | ||
from pathway.xpacks.llm import rerankers | ||
from pathway.xpacks.llm import llms | ||
import pandas as pd | ||
|
||
docs = [ | ||
{"text": "John drinks coffee"}, | ||
{"text": "Someone drinks tea"}, | ||
{"text": "Nobody drinks coca-cola"}, | ||
] | ||
|
||
query = "What does John drink?" | ||
|
||
df = pd.DataFrame({"docs": docs, "prompt": query}) | ||
|
||
chat = llms.OpenAIChat(model="gpt-4o-mini", api_key=API_KEY) | ||
reranker = rerankers.LLMReranker(llm=chat) | ||
|
||
input = pw.debug.table_from_pandas(df) | ||
res = input.select(rank=reranker(pw.this.docs["text"], pw.this.prompt)) | ||
``` | ||
|
||
## CrossEncoderReranker | ||
The [`CrossEncoderReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.CrossEncoderReranker) works on text-pairs and computes a score 0..1 (or the logits if the activation function is not passed). The score determines how relevant the document is to the query. | ||
More information can be found [`here`](https://www.sbert.net/docs/cross_encoder/pretrained_models.html). | ||
|
||
|
||
```python | ||
from pathway.xpacks.llm import rerankers | ||
import pandas as pd | ||
import torch | ||
|
||
docs = [ | ||
{"text": "John drinks coffee"}, | ||
{"text": "Someone drinks tea"}, | ||
{"text": "Nobody drinks coca-cola"}, | ||
] | ||
|
||
query = "What does John drink?" | ||
|
||
df = pd.DataFrame({"docs": docs, "prompt": query}) | ||
|
||
reranker = rerankers.CrossEncoderReranker( | ||
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", | ||
default_activation_function=torch.nn.Sigmoid(), # Make outputs between 0..1 | ||
) | ||
|
||
input = pw.debug.table_from_pandas(df) | ||
res = input.select( | ||
rank=reranker(pw.this.docs["text"], pw.this.prompt), text=pw.this.docs["text"] | ||
) | ||
pw.debug.compute_and_print(res) | ||
``` | ||
|
||
## EncoderReranker | ||
The [`EncoderReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.EncoderReranker) computes the relevance of the query to the supplied documents using the [`SentenceTransformer encoders`](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html). | ||
|
||
```python | ||
from pathway.xpacks.llm import rerankers | ||
import pandas as pd | ||
|
||
docs = [ | ||
{"text": "John drinks coffee"}, | ||
{"text": "Someone drinks tea"}, | ||
{"text": "Nobody drinks coca-cola"}, | ||
] | ||
|
||
query = "What does John drink?" | ||
|
||
df = pd.DataFrame({"docs": docs, "prompt": query}) | ||
|
||
reranker = rerankers.EncoderReranker( | ||
model_name="all-mpnet-base-v2", | ||
) | ||
|
||
input = pw.debug.table_from_pandas(df) | ||
res = input.select( | ||
rank=reranker(pw.this.docs["text"], pw.this.prompt), text=pw.this.docs["text"] | ||
) | ||
``` |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters