Skip to content

Commit

Permalink
Added doc pages in llm tool (#8163)
Browse files Browse the repository at this point in the history
Co-authored-by: tryptofanik <[email protected]>
Co-authored-by: Olivier Ruas <[email protected]>
GitOrigin-RevId: a1f7fbb44ca9caa5885fbfcd782a237b26064055
  • Loading branch information
3 people authored and Manul from Pathway committed Feb 10, 2025
1 parent 7ee332d commit 0aa6dbb
Show file tree
Hide file tree
Showing 9 changed files with 251 additions and 24 deletions.
2 changes: 1 addition & 1 deletion docs/2.developers/4.user-guide/50.llm-xpack/10.overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ or
`pip install "pathway[all]"`
```

## Wrappers for LLMs
## Wrappers for LLMs (LLM Chats)

Out of the box, the LLM xpack provides wrappers for text generation and embedding LLMs. For text generation, you can use native wrappers for the OpenAI, HuggingFace models, Cohere and LiteLLM (which enables you to use many other popular models, including Azure OpenAI, HuggingFace (when using their API) or Gemini.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ wget 'https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/pathway
```


For each document and each query, embeddings are computed using a pre-trained language model. These embeddings are numerical representations of the documents: they are used to find the documents that are most relevant to each query. Pathway offers API integration with premier LLM service providers, including but not limited to OpenAI and HuggingFace. You can import the model interface for the provider of your choice and specify the API key and the model ID to call. By default, the embedder is `text-embedding-ada-002` from OpenAI, which returns vectors of dimension `1536`. Please check out [openai-model-endpoint-compatibility](https://platform.openai.com/docs/models#model-endpoint-compatibility) for more information on the available models.
For each document and each query, embeddings are computed using a pre-trained language model. These embeddings are numerical representations of the documents: they are used to find the documents that are most relevant to each query. Pathway offers API integration with premier LLM service providers, including but not limited to OpenAI and HuggingFace. You can import the model interface for the provider of your choice and specify the API key and the model ID to call. By default, the embedder is `text-embedding-text-3-small` from OpenAI, which returns vectors of dimension `1536`. Please check out [openai-model-endpoint-compatibility](https://platform.openai.com/docs/models#model-endpoint-compatibility) for more information on the available models.

To implement this, remove the LLM query at the end of the program you obtained in the last section: you first need to retrieve context before querying the LLM. You should be left with the following code:
```python [app.py]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ text_splitter = TokenCountSplitter(
)
```

This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI’s `text-embedding-ada-002` model.
This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI's embedding models.

For more on token encodings, refer to [OpenAI's tiktoken guide](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#encodings).

Expand All @@ -71,7 +71,7 @@ from pathway.xpacks.llm.embedders import OpenAIEmbedder
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
```

The default model for `OpenAIEmbedder` is `text-embedding-ada-002`.
The default model for `OpenAIEmbedder` is `text-embedding-3-small`. More information can be find on the [`Embedders page`](/developers/user-guide/llm-xpack/embedders)

## Retriever

Expand Down
33 changes: 33 additions & 0 deletions docs/2.developers/4.user-guide/50.llm-xpack/50.splitters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: 'Chunking text'
description: 'Splitters available through the Pathway xpack'
date: '2025-02-04'
thumbnail: ''
tags: ['splitters', 'chunking']
keywords: ['parsers', 'chunking']
---

# Chunking

Embedding entire documents as a single vector often leads to poor retrieval performance. This happens because the model is forced to compress all the document's information into a single representation, making it difficult to capture granular details. As a result, important context may be lost, and retrieval effectiveness decreases.

There a several strategies how to best chunk a document. A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting.

A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines.

## TokenCountSplitter
Pathway offers a [`TokenCountSplitter`](/developers/api-docs/pathway-xpacks-llm/splitters#pathway.xpacks.llm.splitters.TokenCountSplitter) for token-based chunking. Here's how to use it:

```python
from pathway.xpacks.llm.splitters import TokenCountSplitter

text_splitter = TokenCountSplitter(
min_tokens=100,
max_tokens=500,
encoding_name="cl100k_base"
)
```

This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI's embedding models.

For more on token encodings, refer to [OpenAI's tiktoken guide](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#encodings).
104 changes: 104 additions & 0 deletions docs/2.developers/4.user-guide/50.llm-xpack/60.embedders.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: 'Embedders'
description: 'Embedders available through the Pathway xpack'
date: '2025-02-04'
thumbnail: ''
tags: ['tutorial', 'embedder']
keywords: ['LLM', 'GPT', 'OpenAI', 'Gemini', 'LiteLLM', 'Embedder']
---

# Embedders

When storing a document in a vector store, you compute the embedding vector for the text and store the vector with a reference to the original document. You can then compute the embedding of a query and find the embedded documents closest to the query.

The following embedding wrappers are available through the Pathway xpack:

- [`OpenAIEmbedder`](#openaiembedder) - Embed text with any of OpenAI's embedding models
- [`LiteLLMEmbedder`](#litellmembedder) - Embed text with any model available through LiteLLM
- [`SentenceTransformersEmbedder`](#sentencetransformerembedder) - Embed text with any model available through SentenceTransformer (aka. SBERT) maintained by Hugging Face
- [`GeminiEmbedder`](#gemeniembedder) - Embed text with any of Google's available embedding models

## OpenAIEmbedder
The default model for [`OpenAIEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.OpenAIEmbedder) is `text-embedding-3-small`.

```python
import os
import pathway as pw
from pathway.xpacks.llm.parsers import UnstructuredParser
from pathway.xpacks.llm.embedders import OpenAIEmbedder

files = pw.io.fs.read(
os.environ.get("DATA_DIR"),
mode="streaming",
format="binary",
autocommit_duration_ms=50,
)

# Parse the documents in the specified directory
parser = UnstructuredParser(chunking_mode="paged")
documents = files.select(elements=parser(pw.this.data))
documents = documents.flatten(pw.this.elements) # flatten list into multiple rows
documents = documents.select(text=pw.this.elements[0], metadata=pw.this.elements[1])

# Embed each page of the document
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
embeddings = documents.select(embedding=embedder(pw.this.text))
```

## LiteLLMEmbedder
The model for [`LiteLLMEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.LiteLLMEmbedder) has to be specified during initialization. No default is provided.

```python
from pathway.xpacks.llm import embedders

embedder = embedders.LiteLLMEmbedder(
model="text-embedding-3-small", api_key=API_KEY
)
# Create a table with one column for the text to embed
t = pw.debug.table_from_markdown(
"""
text_column
Here is some text
"""
)
res = t.select(ret=embedder(pw.this.text_column))
```

## SentenceTransformerEmbedder
This [`SentenceTransformerEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.SentenceTransformerEmbedder) embedder allows you to use the models from the Hugging Face Sentence Transformer models.

The model is specified during initialization. Here is a list of [`available models`](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).

```python
import pathway as pw
from pathway.xpacks.llm import embedders

embedder = embedders.SentenceTransformerEmbedder(model="intfloat/e5-large-v2")

# Create a table with text to embed
t = pw.debug.table_from_markdown('''
txt
Some text to embed
''')

# Extract the embedded text
t.select(ret=embedder(pw.this.txt))
```

## GemeniEmbedder
[`GemeniEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.GeminiEmbedder) is the embedder for Google's Gemeni Embedding Services. Available models can be found [`here`](https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding).

```python
import pathway as pw
from pathway.xpacks.llm import embedders

embedder = embedders.GeminiEmbedder(model="models/text-embedding-004")

# Create a table with a column for the text to embed
t = pw.debug.table_from_markdown('''
txt
Some text to embed
''')

t.select(ret=embedder(pw.this.txt))
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: 'LLM Chats'
description: 'LLM Wrappers and embedders available through Pathway xpack'
description: 'LLM Wrappers available through Pathway xpack'
date: '2025-01-30'
thumbnail: ''
tags: ['tutorial', 'LLM', 'LLM Wrappers', 'LLM Chats']
Expand Down Expand Up @@ -181,23 +181,10 @@ model = llms.OpenAIChat(
# if PATHWAY_PERSISTENT_STORAGE is set, then it is used to cache the calls
cache_strategy=pw.udfs.DefaultCache(),
# select the model
model="gpt-3.5-turbo",
model="gpt-4o-mini",
# read OpenAI API key from environmental variables
api_key=os.environ["OPENAI_API_KEY"],
)
responses = queries.select(result=model(prompt_chat_single_qa(pw.this.questions)))
pw.debug.compute_and_print(responses)
```


# Embedders

Pathway also comes with wrappers for embedding models:
- [`OpenAIEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders#pathway.xpacks.llm.embedders.OpenAIEmbedder)
- [`LiteLLMEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders#pathway.xpacks.llm.embedders.LiteLLMEmbedder)
- [`SentenceTransformersEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders#pathway.xpacks.llm.embedders.SentenceTransformerEmbedder)
- [`GeminiEmbedder`](https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding)

For more information on the embedders, refer to the Embedding section on the [`Document Indexing`](/developers/user-guide/llm-xpack/docs-indexing) page


103 changes: 103 additions & 0 deletions docs/2.developers/4.user-guide/50.llm-xpack/80.rerankers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: 'Rerankers'
description: 'Rerankers available through the Pathway xpack'
date: '2025-02-04'
thumbnail: ''
tags: ['tutorial', 'reranker']
keywords: ['LLM', 'Reranker']
---

# Rerankers
Rerankers can be used to have the model rank the relevance of documents against a query or text.

In RAG systems, initial sparse retrieval is often based on cosine similarity. This will likely result in some documents not being relevant to the query. This happens because retrieval is typically based on vector representations that condense a passage's meaning into a single embedding, which can overlook important nuances. While this approach is fast, it can also be inaccurate. To improve the quality of retrieved documents, it is common to refine the initial set and select only the most relevant ones using rerankers.
Rerankers help the model reassess and prioritize documents based on their relevance to the query. This is usually done by presenting the model with `(query, document)` pairs and evaluating whether the given `document` contributes to answering the `query`.

Pathway xpack provides the following rerankers:
- [`LLMReranker`](#llmreranker) - Have an LLM rerank the documents
- [`CrossEncoderReranker`](#crossencoderreranker) - Rerank with a CrossEncoder from SBERT / SentenceTransformers)
- [`EncoderReranker`](#encoderreranker): Rerank with SentenceTransformers EncoderRerank (measure similarity)

## LLMReranker
The [`LLMReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.LLMReranker) asks the provided LLM to evaluate the relevance of a query against the provided documents on a scale 1-5.

```python
from pathway.xpacks.llm import rerankers
from pathway.xpacks.llm import llms
import pandas as pd

docs = [
{"text": "John drinks coffee"},
{"text": "Someone drinks tea"},
{"text": "Nobody drinks coca-cola"},
]

query = "What does John drink?"

df = pd.DataFrame({"docs": docs, "prompt": query})

chat = llms.OpenAIChat(model="gpt-4o-mini", api_key=API_KEY)
reranker = rerankers.LLMReranker(llm=chat)

input = pw.debug.table_from_pandas(df)
res = input.select(rank=reranker(pw.this.docs["text"], pw.this.prompt))
```

## CrossEncoderReranker
The [`CrossEncoderReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.CrossEncoderReranker) works on text-pairs and computes a score 0..1 (or the logits if the activation function is not passed). The score determines how relevant the document is to the query.
More information can be found [`here`](https://www.sbert.net/docs/cross_encoder/pretrained_models.html).


```python
from pathway.xpacks.llm import rerankers
import pandas as pd
import torch

docs = [
{"text": "John drinks coffee"},
{"text": "Someone drinks tea"},
{"text": "Nobody drinks coca-cola"},
]

query = "What does John drink?"

df = pd.DataFrame({"docs": docs, "prompt": query})

reranker = rerankers.CrossEncoderReranker(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
default_activation_function=torch.nn.Sigmoid(), # Make outputs between 0..1
)

input = pw.debug.table_from_pandas(df)
res = input.select(
rank=reranker(pw.this.docs["text"], pw.this.prompt), text=pw.this.docs["text"]
)
pw.debug.compute_and_print(res)
```

## EncoderReranker
The [`EncoderReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.EncoderReranker) computes the relevance of the query to the supplied documents using the [`SentenceTransformer encoders`](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).

```python
from pathway.xpacks.llm import rerankers
import pandas as pd

docs = [
{"text": "John drinks coffee"},
{"text": "Someone drinks tea"},
{"text": "Nobody drinks coca-cola"},
]

query = "What does John drink?"

df = pd.DataFrame({"docs": docs, "prompt": query})

reranker = rerankers.EncoderReranker(
model_name="all-mpnet-base-v2",
)

input = pw.debug.table_from_pandas(df)
res = input.select(
rank=reranker(pw.this.docs["text"], pw.this.prompt), text=pw.this.docs["text"]
)
```
10 changes: 5 additions & 5 deletions python/pathway/xpacks/llm/embedders.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ class OpenAIEmbedder(BaseEmbedder):
>>> import pathway as pw
>>> from pathway.xpacks.llm import embedders
>>> embedder = embedders.OpenAIEmbedder(model="text-embedding-ada-002")
>>> embedder = embedders.OpenAIEmbedder(model="text-embedding-3-small")
>>> t = pw.debug.table_from_markdown('''
... txt
... Text
Expand All @@ -132,7 +132,7 @@ class OpenAIEmbedder(BaseEmbedder):
>>> embedder = embedders.OpenAIEmbedder()
>>> t = pw.debug.table_from_markdown('''
... txt | model
... Text | text-embedding-ada-002
... Text | text-embedding-3-small
... ''')
>>> t.select(ret=embedder(pw.this.txt, model=pw.this.model))
<pathway.Table schema={'ret': numpy.ndarray[typing.Any, numpy.dtype[typing.Any]]}>
Expand All @@ -144,7 +144,7 @@ def __init__(
capacity: int | None = None,
retry_strategy: udfs.AsyncRetryStrategy | None = None,
cache_strategy: udfs.CacheStrategy | None = None,
model: str | None = "text-embedding-ada-002",
model: str | None = "text-embedding-3-small",
**openai_kwargs,
):
with optional_imports("xpack-llm"):
Expand Down Expand Up @@ -211,7 +211,7 @@ class LiteLLMEmbedder(BaseEmbedder):
>>> import pathway as pw
>>> from pathway.xpacks.llm import embedders
>>> embedder = embedders.LiteLLMEmbedder(model="text-embedding-ada-002")
>>> embedder = embedders.LiteLLMEmbedder(model="text-embedding-3-small")
>>> t = pw.debug.table_from_markdown('''
... txt
... Text
Expand All @@ -224,7 +224,7 @@ class LiteLLMEmbedder(BaseEmbedder):
>>> embedder = embedders.LiteLLMEmbedder()
>>> t = pw.debug.table_from_markdown('''
... txt | model
... Text | text-embedding-ada-002
... Text | text-embedding-3-small
... ''')
>>> t.select(ret=embedder(pw.this.txt, model=pw.this.model))
<pathway.Table schema={'ret': numpy.ndarray[typing.Any, numpy.dtype[typing.Any]]}>
Expand Down

0 comments on commit 0aa6dbb

Please sign in to comment.