-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
218 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,17 @@ | ||
.. _Ref-Query: | ||
|
||
Query | ||
=========== | ||
============== | ||
|
||
.. automodule:: sycamore.query | ||
This package allows you to build sophisticated LLM query-powered pipelines using Sycamore. | ||
|
||
.. automodule:: sycamore.query.client | ||
:members: | ||
|
||
.. automodule:: sycamore.query.planner | ||
:members: | ||
|
||
.. automodule:: sycamore.query.logical_plan | ||
:members: | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
Querying Data with Sycamore | ||
============================ | ||
|
||
Beyond using Sycamore for document processing and loading data stores, Sycamore can also | ||
be used to implement sophisticated query pipelines over unstructured, semi-structured, | ||
and structured data. These query pipelines can use the full range of Sycamore operations | ||
to transform, filter, and aggregate data in a variety of ways. This provides a powerful | ||
abstraction that goes beyond conventional query languages (like SQL) and LLM-based data | ||
retrieval techniques (like RAG). | ||
|
||
Overview | ||
-------- | ||
|
||
Sycamore Query consists of a few components, which are found in the | ||
``sycamore.query`` package, {doc}`documented here </sycamore/APIs/query>`. | ||
|
||
The ``SycamoreQueryClient`` class is the primary interface to the Sycamore | ||
Query engine. It is configured with a pointer to an underlying data source | ||
(currently, an OpenSearch index). The ``SycamoreQueryClient.query()`` method | ||
allows one to query the data source using a natural-language query, getting | ||
a Sycamore ``DocSet`` as a result. | ||
|
||
Here is a simple example: | ||
|
||
.. code-block:: python | ||
from sycamore.query.client import SycamoreQueryClient | ||
# If no data source is specified, the OpenSearch server running on localhost:9200 is used. | ||
client = SycamoreQueryClient() | ||
# Generate a query plan and run it for the given OpenSearch index. | ||
result = client.query( | ||
query="How many incidents were reported in bad weather conditions?", | ||
index="const_ntsb") | ||
.. | ||
Under the hood, this is using an LLM to generate a *query plan* from the | ||
natural-language query, which consists of a pipeline of Sycamore operators | ||
that retrieve, transform, and aggregate data. One can also use the | ||
``LogicalPlan`` class to build query plans directly, without the help of an LLM. | ||
|
||
The range of operators supported by Sycamore Query is quite broad, including | ||
filtering, aggregation, group-by, count, sorting, and mathematical operations. | ||
Sycamore Query also supports LLM-powered query operators, such as | ||
``LlmFilter``, which uses an LLM to filter data, and ``SummarizeData``, which | ||
takes a Sycamore DocSet as input, and uses the LLM to produce a natural language | ||
response given a prompt. | ||
|
||
One can think of Sycamore Query as a much more sophisticated form of RAG, | ||
using the power of Sycamore's data-processing operations alongside the | ||
semantic power of an LLM, allowing you to run queries that would be | ||
impossible for a RAG system to answer reliably. | ||
|
||
The Sycamore Query UI | ||
--------------------- | ||
|
||
The directory ``apps/query-ui`` in the Sycamore tree contains a web-based UI to | ||
Sycamore Query, making it easy to experiment with queries, inspect query plans, | ||
and debug the results. To run it, simply run the following in a checkout of | ||
the Sycamore source tree: | ||
|
||
.. code-block:: python | ||
cd apps/query-ui | ||
poetry install | ||
poetry run queryui/main.py | ||
.. | ||
By default, the UI will query OpenSearch running locally on port 9200. | ||
|
||
Sycamore Query Plans | ||
-------------------- | ||
|
||
The query plans generated by Sycamore Query are instances of the ``LogicalPlan`` | ||
class and represent a tree of operators that fetch, filter, aggregate, or process data | ||
to produce a final result. You can inspect the query plan generated for a given query | ||
in the UI (as described above) or by using the ``SycamoreQueryClient.generate_plan()`` | ||
method. | ||
|
||
For example, a query plan for a query such as | ||
**"What is the breakdown of aircraft types for incidents with substantial damage?** | ||
might be as follows: | ||
|
||
.. code-block:: python | ||
{ | ||
"nodes": { | ||
"0": QueryDatabase( | ||
node_id=0, | ||
description="Get all the incident reports with substantial aircraft damage", | ||
input=None, | ||
index="const_ntsb", | ||
query={"match": {"properties.entity.aircraftDamage": "Substantial"}} | ||
), | ||
"1": TopK( | ||
node_id=1, | ||
description="Get the breakdown of aircraft types", | ||
input=[0], | ||
field="properties.entity.aircraft", | ||
primary_field="properties.entity.accidentNumber", | ||
K=100, | ||
descending=False) | ||
} | ||
} | ||
.. | ||
Essentially this is querying OpenSearch for records matching the given OpenSearch query, | ||
and feeding the results to a ``TopK`` operation that performs a group-by on the | ||
aircraft type. | ||
|
||
``SycamoreQueryClient`` uses an LLM and knowledge of the OpenSearch | ||
index schema to generate these query plans automatically from natural-language queries. | ||
However, you can also construct a ``LogicalPlan`` directly, in code, and pass it to | ||
``SycamoreQueryClient.run_plan()`` to run it. | ||
|
||
|
||
Caching and performance | ||
----------------------- | ||
|
||
If you are running multiple queries that use the same intermediate results, | ||
Sycamore Query can cache those intermediate results to avoid recomputing them | ||
for subsequent queries. This is helpful from a performance and LLM cost perspective. | ||
|
||
To use this feature, pass the ``cache_dir`` option to ``SycamoreQueryClient``: | ||
|
||
.. code-block:: python | ||
client = SycamoreQueryClient(cache_dir="/path/to/cache/dir") | ||
.. | ||
Intermediate query results will be written to this directory and reused for | ||
subsequent queries using the same ``cache_dir`` setting. If you wish to invalidate | ||
the cache, simply remove the contents of your ``cache_dir``. | ||
|
||
Sycamore Query can also cache the results of LLM calls, saving time and money when | ||
many LLM operations are being performed. To use this feature, pass the | ||
``s3_cache_path`` option to ``SycamoreQueryClient``. | ||
|
||
.. code-block:: python | ||
client = SycamoreQueryClient(s3_cache_path="/path/to/llm_cache/dir") | ||
.. | ||
(Note that the name of this flag is a misnomer; it need not be an S3 path.) | ||
|
||
The ``cache_dir`` and ``s3_cache_path`` settings can either be local filesystem | ||
paths, or locations of S3 buckets (e.g., ``s3://your-bucket/query-cache``). | ||
|
||
Debugging query execution | ||
------------------------- | ||
|
||
Sycamore Query will write the output of each node of the query plan as it runs | ||
to a trace directory that you specify, allowing you to inspect the results as they | ||
flow through the query plan. To use this feature, pass the ``trace_dir`` | ||
to ``SycamoreQueryClient``: | ||
|
||
.. code-block:: python | ||
client = SycamoreQueryClient(trace_dir="/path/to/trace/dir") | ||
.. | ||
The contents of the ``trace_dir`` will be populated with files containing | ||
the output of each query operator (note that these can be quite large, depending | ||
on the amount of data you are querying). The layout of the directory will be: | ||
|
||
.. code-block:: | ||
trace_dir/ | ||
<query_id>/ | ||
<node_id>/ | ||
doc-<doc_uuid_1>.pickle | ||
doc-<doc_uuid_2>.pickle | ||
... | ||
.. | ||
where ``<query_id>`` is a unique ID representing the query that was executed, | ||
``<node_id>`` is the node ID in the query plan, and ``<doc_uuid_NNNN>`` is a unique | ||
ID for each document in the ``DocSet`` that was emitted by that node in the query plan. | ||
|
||
These are Python pickle files containing the contents of each Sycamore ``Document`` emitted by | ||
the corresponding query node. You can read them back with code like the following: | ||
|
||
.. code-block:: python | ||
import os | ||
from sycamore.data import Document | ||
docs = {} | ||
for node_id in os.listdir(trace_dir): | ||
docs[node_id] = [] | ||
for filename in os.listdir(os.path.join(trace_dir, node_id)): | ||
f = os.path.join(trace_dir, node_id, filename) | ||
with open(f, "rb") as file: | ||
doc = Document.deserialize(f.read()) | ||
docs[node_id].append(doc) | ||
.. | ||
|
||
|