Added doc pages in llm tool (#8163)

Co-authored-by: tryptofanik <[email protected]> Co-authored-by: Olivier Ruas <[email protected]> GitOrigin-RevId: a1f7fbb44ca9caa5885fbfcd782a237b26064055
pathwaycom · Feb 10, 2025 · 0aa6dbb · 0aa6dbb
1 parent 7ee332d
commit 0aa6dbb
Show file tree

Hide file tree

Showing 9 changed files with 251 additions and 24 deletions.
diff --git a/docs/2.developers/4.user-guide/50.llm-xpack/10.overview.md b/docs/2.developers/4.user-guide/50.llm-xpack/10.overview.md
@@ -24,7 +24,7 @@ or
 `pip install "pathway[all]"`
 ```
 
-## Wrappers for LLMs
+## Wrappers for LLMs (LLM Chats)
 
 Out of the box, the LLM xpack provides wrappers for text generation and embedding LLMs. For text generation, you can use native wrappers for the OpenAI, HuggingFace models, Cohere and LiteLLM (which enables you to use many other popular models, including Azure OpenAI, HuggingFace (when using their API) or Gemini.
 

diff --git a/...-guide/50.llm-xpack/30.llm-app-pathway.md → ...-guide/50.llm-xpack/20.llm-app-pathway.md b/...-guide/50.llm-xpack/30.llm-app-pathway.md → ...-guide/50.llm-xpack/20.llm-app-pathway.md
@@ -175,7 +175,7 @@ wget 'https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/pathway
 ```
 
 
-For each document and each query, embeddings are computed using a pre-trained language model. These embeddings are numerical representations of the documents: they are used to find the documents that are most relevant to each query. Pathway offers API integration with premier LLM service providers, including but not limited to OpenAI and HuggingFace. You can import the model interface for the provider of your choice and specify the API key and the model ID to call. By default, the embedder is `text-embedding-ada-002` from OpenAI, which returns vectors of dimension `1536`. Please check out [openai-model-endpoint-compatibility](https://platform.openai.com/docs/models#model-endpoint-compatibility) for more information on the available models.
+For each document and each query, embeddings are computed using a pre-trained language model. These embeddings are numerical representations of the documents: they are used to find the documents that are most relevant to each query. Pathway offers API integration with premier LLM service providers, including but not limited to OpenAI and HuggingFace. You can import the model interface for the provider of your choice and specify the API key and the model ID to call. By default, the embedder is `text-embedding-text-3-small` from OpenAI, which returns vectors of dimension `1536`. Please check out [openai-model-endpoint-compatibility](https://platform.openai.com/docs/models#model-endpoint-compatibility) for more information on the available models.
 
 To implement this, remove the LLM query at the end of the program you obtained in the last section: you first need to retrieve context before querying the LLM. You should be left with the following code:
 ```python [app.py]

diff --git a/...er-guide/50.llm-xpack/40.docs-indexing.md → ...er-guide/50.llm-xpack/30.docs-indexing.md b/...er-guide/50.llm-xpack/40.docs-indexing.md → ...er-guide/50.llm-xpack/30.docs-indexing.md
@@ -47,7 +47,7 @@ text_splitter = TokenCountSplitter(
 )
 ```
 
-This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI’s `text-embedding-ada-002` model.
+This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI's embedding models.
 
 For more on token encodings, refer to [OpenAI's tiktoken guide](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#encodings).
 
@@ -71,7 +71,7 @@ from pathway.xpacks.llm.embedders import OpenAIEmbedder
 embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
 ```
 
-The default model for `OpenAIEmbedder` is `text-embedding-ada-002`.
+The default model for `OpenAIEmbedder` is `text-embedding-3-small`. More information can be find on the [`Embedders page`](/developers/user-guide/llm-xpack/embedders)
 
 ## Retriever
 

diff --git a/docs/2.developers/4.user-guide/50.llm-xpack/50.splitters.md b/docs/2.developers/4.user-guide/50.llm-xpack/50.splitters.md
@@ -0,0 +1,33 @@
+---
+title: 'Chunking text'
+description: 'Splitters available through the Pathway xpack'
+date: '2025-02-04'
+thumbnail: ''
+tags: ['splitters', 'chunking']
+keywords: ['parsers', 'chunking']
+---
+
+# Chunking
+
+Embedding entire documents as a single vector often leads to poor retrieval performance. This happens because the model is forced to compress all the document's information into a single representation, making it difficult to capture granular details. As a result, important context may be lost, and retrieval effectiveness decreases.
+
+There a several strategies how to best chunk a document. A simple approach might involve slicing the text every n characters. However, this can split sentences or phrases awkwardly, resulting in incomplete or distorted chunks. Additionally, token counts vary (a token might be a character, word, or punctuation), making it hard to manage consistent chunk sizes with character-based splitting.
+
+A better method is to chunk the text by tokens, ensuring each chunk makes sense and aligns with sentence or paragraph boundaries. Token-based chunking is typically done at logical breakpoints, such as periods, commas, or newlines.
+
+## TokenCountSplitter
+Pathway offers a [`TokenCountSplitter`](/developers/api-docs/pathway-xpacks-llm/splitters#pathway.xpacks.llm.splitters.TokenCountSplitter) for token-based chunking. Here's how to use it:
+
+```python
+from pathway.xpacks.llm.splitters import TokenCountSplitter
+
+text_splitter = TokenCountSplitter(
+    min_tokens=100,
+    max_tokens=500,
+    encoding_name="cl100k_base"
+)
+```
+
+This configuration creates chunks of 100–500 tokens using the `cl100k_base` tokenizer, compatible with OpenAI's embedding models.
+
+For more on token encodings, refer to [OpenAI's tiktoken guide](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#encodings).
diff --git a/docs/2.developers/4.user-guide/50.llm-xpack/60.embedders.md b/docs/2.developers/4.user-guide/50.llm-xpack/60.embedders.md
@@ -0,0 +1,104 @@
+---
+title: 'Embedders'
+description: 'Embedders available through the Pathway xpack'
+date: '2025-02-04'
+thumbnail: ''
+tags: ['tutorial', 'embedder']
+keywords: ['LLM', 'GPT', 'OpenAI', 'Gemini', 'LiteLLM', 'Embedder']
+---
+
+# Embedders
+
+When storing a document in a vector store, you compute the embedding vector for the text and store the vector with a reference to the original document. You can then compute the embedding of a query and find the embedded documents closest to the query.
+
+The following embedding wrappers are available through the Pathway xpack:
+
+- [`OpenAIEmbedder`](#openaiembedder) - Embed text with any of OpenAI's embedding models
+- [`LiteLLMEmbedder`](#litellmembedder) - Embed text with any model available through LiteLLM
+- [`SentenceTransformersEmbedder`](#sentencetransformerembedder) - Embed text with any model available through SentenceTransformer (aka. SBERT) maintained by Hugging Face
+- [`GeminiEmbedder`](#gemeniembedder) - Embed text with any of Google's available embedding models
+
+## OpenAIEmbedder
+The default model for [`OpenAIEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.OpenAIEmbedder) is `text-embedding-3-small`.
+
+```python
+import os
+import pathway as pw
+from pathway.xpacks.llm.parsers import UnstructuredParser
+from pathway.xpacks.llm.embedders import OpenAIEmbedder
+
+files = pw.io.fs.read(
+    os.environ.get("DATA_DIR"),
+    mode="streaming",
+    format="binary",
+    autocommit_duration_ms=50,
+)
+
+# Parse the documents in the specified directory
+parser = UnstructuredParser(chunking_mode="paged")
+documents = files.select(elements=parser(pw.this.data))
+documents = documents.flatten(pw.this.elements)  # flatten list into multiple rows
+documents = documents.select(text=pw.this.elements[0], metadata=pw.this.elements[1])
+
+# Embed each page of the document
+embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
+embeddings = documents.select(embedding=embedder(pw.this.text))
+```
+
+## LiteLLMEmbedder
+The model for [`LiteLLMEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.LiteLLMEmbedder) has to be specified during initialization. No default is provided.
+
+```python
+from pathway.xpacks.llm import embedders
+
+embedder = embedders.LiteLLMEmbedder(
+    model="text-embedding-3-small", api_key=API_KEY
+)
+# Create a table with one column for the text to embed
+t = pw.debug.table_from_markdown(
+    """
+text_column
+Here is some text
+"""
+)
+res = t.select(ret=embedder(pw.this.text_column))
+```
+
+## SentenceTransformerEmbedder
+This [`SentenceTransformerEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.SentenceTransformerEmbedder) embedder allows you to use the models from the Hugging Face Sentence Transformer models.
+
+The model is specified during initialization. Here is a list of [`available models`](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).
+
+```python
+import pathway as pw
+from pathway.xpacks.llm import embedders
+
+embedder = embedders.SentenceTransformerEmbedder(model="intfloat/e5-large-v2")
+
+# Create a table with text to embed
+t = pw.debug.table_from_markdown('''
+txt
+Some text to embed
+''')
+
+# Extract the embedded text
+t.select(ret=embedder(pw.this.txt))
+```
+
+## GemeniEmbedder
+[`GemeniEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders/#pathway.xpacks.llm.embedders.GeminiEmbedder) is the embedder for Google's Gemeni Embedding Services. Available models can be found [`here`](https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding).
+
+```python
+import pathway as pw
+from pathway.xpacks.llm import embedders
+
+embedder = embedders.GeminiEmbedder(model="models/text-embedding-004")
+
+# Create a table with a column for the text to embed
+t = pw.debug.table_from_markdown('''
+txt
+Some text to embed
+''')
+
+t.select(ret=embedder(pw.this.txt))
+```
diff --git a/...4.user-guide/50.llm-xpack/20.llm-chats.md → ...4.user-guide/50.llm-xpack/70.llm-chats.md b/...4.user-guide/50.llm-xpack/20.llm-chats.md → ...4.user-guide/50.llm-xpack/70.llm-chats.md
@@ -1,6 +1,6 @@
 ---
 title: 'LLM Chats'
-description: 'LLM Wrappers and embedders available through Pathway xpack'
+description: 'LLM Wrappers available through Pathway xpack'
 date: '2025-01-30'
 thumbnail: ''
 tags: ['tutorial', 'LLM', 'LLM Wrappers', 'LLM Chats']
@@ -181,23 +181,10 @@ model = llms.OpenAIChat(
     # if PATHWAY_PERSISTENT_STORAGE is set, then it is used to cache the calls
     cache_strategy=pw.udfs.DefaultCache(),
     # select the model
-    model="gpt-3.5-turbo",
+    model="gpt-4o-mini",
     # read OpenAI API key from environmental variables
     api_key=os.environ["OPENAI_API_KEY"],
 )
 responses = queries.select(result=model(prompt_chat_single_qa(pw.this.questions)))
 pw.debug.compute_and_print(responses)
 ```
-
-
-# Embedders
-
-Pathway also comes with wrappers for embedding models:
-- [`OpenAIEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders#pathway.xpacks.llm.embedders.OpenAIEmbedder)
-- [`LiteLLMEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders#pathway.xpacks.llm.embedders.LiteLLMEmbedder)
-- [`SentenceTransformersEmbedder`](/developers/api-docs/pathway-xpacks-llm/embedders#pathway.xpacks.llm.embedders.SentenceTransformerEmbedder)
-- [`GeminiEmbedder`](https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding)
-
-For more information on the embedders, refer to the Embedding section on the [`Document Indexing`](/developers/user-guide/llm-xpack/docs-indexing) page
-
-
diff --git a/docs/2.developers/4.user-guide/50.llm-xpack/80.rerankers.md b/docs/2.developers/4.user-guide/50.llm-xpack/80.rerankers.md
@@ -0,0 +1,103 @@
+---
+title: 'Rerankers'
+description: 'Rerankers available through the Pathway xpack'
+date: '2025-02-04'
+thumbnail: ''
+tags: ['tutorial', 'reranker']
+keywords: ['LLM', 'Reranker']
+---
+
+# Rerankers 
+Rerankers can be used to have the model rank the relevance of documents against a query or text.
+
+In RAG systems, initial sparse retrieval is often based on cosine similarity. This will likely result in some documents not being relevant to the query. This happens because retrieval is typically based on vector representations that condense a passage's meaning into a single embedding, which can overlook important nuances. While this approach is fast, it can also be inaccurate. To improve the quality of retrieved documents, it is common to refine the initial set and select only the most relevant ones using rerankers.
+Rerankers help the model reassess and prioritize documents based on their relevance to the query. This is usually done by presenting the model with `(query, document)` pairs and evaluating whether the given `document` contributes to answering the `query`.
+
+Pathway xpack provides the following rerankers:
+- [`LLMReranker`](#llmreranker) - Have an LLM rerank the documents
+- [`CrossEncoderReranker`](#crossencoderreranker) - Rerank with a CrossEncoder from SBERT / SentenceTransformers)
+- [`EncoderReranker`](#encoderreranker): Rerank with SentenceTransformers EncoderRerank (measure similarity)
+
+## LLMReranker
+The [`LLMReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.LLMReranker) asks the provided LLM to evaluate the relevance of a query against the provided documents on a scale 1-5.
+
+```python
+from pathway.xpacks.llm import rerankers
+from pathway.xpacks.llm import llms
+import pandas as pd
+
+docs = [
+    {"text": "John drinks coffee"},
+    {"text": "Someone drinks tea"},
+    {"text": "Nobody drinks coca-cola"},
+]
+
+query = "What does John drink?"
+
+df = pd.DataFrame({"docs": docs, "prompt": query})
+
+chat = llms.OpenAIChat(model="gpt-4o-mini", api_key=API_KEY)
+reranker = rerankers.LLMReranker(llm=chat)
+
+input = pw.debug.table_from_pandas(df)
+res = input.select(rank=reranker(pw.this.docs["text"], pw.this.prompt))
+```
+
+## CrossEncoderReranker
+The [`CrossEncoderReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.CrossEncoderReranker) works on text-pairs and computes a score 0..1 (or the logits if the activation function is not passed). The score determines how relevant the document is to the query.
+More information can be found [`here`](https://www.sbert.net/docs/cross_encoder/pretrained_models.html).
+
+
+```python
+from pathway.xpacks.llm import rerankers
+import pandas as pd
+import torch
+
+docs = [
+    {"text": "John drinks coffee"},
+    {"text": "Someone drinks tea"},
+    {"text": "Nobody drinks coca-cola"},
+]
+
+query = "What does John drink?"
+
+df = pd.DataFrame({"docs": docs, "prompt": query})
+
+reranker = rerankers.CrossEncoderReranker(
+    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
+    default_activation_function=torch.nn.Sigmoid(), # Make outputs between 0..1
+)
+
+input = pw.debug.table_from_pandas(df)
+res = input.select(
+    rank=reranker(pw.this.docs["text"], pw.this.prompt), text=pw.this.docs["text"]
+)
+pw.debug.compute_and_print(res)
+```
+
+## EncoderReranker
+The [`EncoderReranker`](/developers/api-docs/pathway-xpacks-llm/rerankers#pathway.xpacks.llm.rerankers.EncoderReranker) computes the relevance of the query to the supplied documents using the [`SentenceTransformer encoders`](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).
+
+```python
+from pathway.xpacks.llm import rerankers
+import pandas as pd
+
+docs = [
+    {"text": "John drinks coffee"},
+    {"text": "Someone drinks tea"},
+    {"text": "Nobody drinks coca-cola"},
+]
+
+query = "What does John drink?"
+
+df = pd.DataFrame({"docs": docs, "prompt": query})
+
+reranker = rerankers.EncoderReranker(
+    model_name="all-mpnet-base-v2",
+)
+
+input = pw.debug.table_from_pandas(df)
+res = input.select(
+    rank=reranker(pw.this.docs["text"], pw.this.prompt), text=pw.this.docs["text"]
+)
+```
diff --git a/...-guide/50.llm-xpack/50.unstructed-data.md → ...-guide/50.llm-xpack/90.unstructed-data.md b/...-guide/50.llm-xpack/50.unstructed-data.md → ...-guide/50.llm-xpack/90.unstructed-data.md
diff --git a/python/pathway/xpacks/llm/embedders.py b/python/pathway/xpacks/llm/embedders.py
@@ -119,7 +119,7 @@ class OpenAIEmbedder(BaseEmbedder):
 
     >>> import pathway as pw
     >>> from pathway.xpacks.llm import embedders
-    >>> embedder = embedders.OpenAIEmbedder(model="text-embedding-ada-002")
+    >>> embedder = embedders.OpenAIEmbedder(model="text-embedding-3-small")
     >>> t = pw.debug.table_from_markdown('''
     ... txt
     ... Text
@@ -132,7 +132,7 @@ class OpenAIEmbedder(BaseEmbedder):
     >>> embedder = embedders.OpenAIEmbedder()
     >>> t = pw.debug.table_from_markdown('''
     ... txt  | model
-    ... Text | text-embedding-ada-002
+    ... Text | text-embedding-3-small
     ... ''')
     >>> t.select(ret=embedder(pw.this.txt, model=pw.this.model))
     <pathway.Table schema={'ret': numpy.ndarray[typing.Any, numpy.dtype[typing.Any]]}>
@@ -144,7 +144,7 @@ def __init__(
         capacity: int | None = None,
         retry_strategy: udfs.AsyncRetryStrategy | None = None,
         cache_strategy: udfs.CacheStrategy | None = None,
-        model: str | None = "text-embedding-ada-002",
+        model: str | None = "text-embedding-3-small",
         **openai_kwargs,
     ):
         with optional_imports("xpack-llm"):
@@ -211,7 +211,7 @@ class LiteLLMEmbedder(BaseEmbedder):
 
     >>> import pathway as pw
     >>> from pathway.xpacks.llm import embedders
-    >>> embedder = embedders.LiteLLMEmbedder(model="text-embedding-ada-002")
+    >>> embedder = embedders.LiteLLMEmbedder(model="text-embedding-3-small")
     >>> t = pw.debug.table_from_markdown('''
     ... txt
     ... Text
@@ -224,7 +224,7 @@ class LiteLLMEmbedder(BaseEmbedder):
     >>> embedder = embedders.LiteLLMEmbedder()
     >>> t = pw.debug.table_from_markdown('''
     ... txt  | model
-    ... Text | text-embedding-ada-002
+    ... Text | text-embedding-3-small
     ... ''')
     >>> t.select(ret=embedder(pw.this.txt, model=pw.this.model))
     <pathway.Table schema={'ret': numpy.ndarray[typing.Any, numpy.dtype[typing.Any]]}>