Skip to content

Commit

Permalink
Add query documentation.
Browse files Browse the repository at this point in the history
  • Loading branch information
mdwelsh committed Oct 5, 2024
1 parent e1ea7bb commit 154f7d5
Show file tree
Hide file tree
Showing 3 changed files with 218 additions and 2 deletions.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,5 +120,6 @@ More Resources
/sycamore/using_jupyter.md
/sycamore/transforms.rst
/sycamore/connectors.rst
/sycamore/query.rst
/sycamore/tutorials.rst
/sycamore/APIs.rst
14 changes: 12 additions & 2 deletions docs/source/sycamore/APIs/query.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
.. _Ref-Query:

Query
===========
==============

.. automodule:: sycamore.query
This package allows you to build sophisticated LLM query-powered pipelines using Sycamore.

.. automodule:: sycamore.query.client
:members:

.. automodule:: sycamore.query.planner
:members:

.. automodule:: sycamore.query.logical_plan
:members:


205 changes: 205 additions & 0 deletions docs/source/sycamore/query.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
Querying Data with Sycamore
============================

Beyond using Sycamore for document processing and loading data stores, Sycamore can also
be used to implement sophisticated query pipelines over unstructured, semi-structured,
and structured data. These query pipelines can use the full range of Sycamore operations
to transform, filter, and aggregate data in a variety of ways. This provides a powerful
abstraction that goes beyond conventional query languages (like SQL) and LLM-based data
retrieval techniques (like RAG).

Overview
--------

Sycamore Query consists of a few components, which are found in the
``sycamore.query`` package, {doc}`documented here </sycamore/APIs/query>`.

The ``SycamoreQueryClient`` class is the primary interface to the Sycamore
Query engine. It is configured with a pointer to an underlying data source
(currently, an OpenSearch index). The ``SycamoreQueryClient.query()`` method
allows one to query the data source using a natural-language query, getting
a Sycamore ``DocSet`` as a result.

Here is a simple example:

.. code-block:: python
from sycamore.query.client import SycamoreQueryClient
# If no data source is specified, the OpenSearch server running on localhost:9200 is used.
client = SycamoreQueryClient()
# Generate a query plan and run it for the given OpenSearch index.
result = client.query(
query="How many incidents were reported in bad weather conditions?",
index="const_ntsb")
..
Under the hood, this is using an LLM to generate a *query plan* from the
natural-language query, which consists of a pipeline of Sycamore operators
that retrieve, transform, and aggregate data. One can also use the
``LogicalPlan`` class to build query plans directly, without the help of an LLM.

The range of operators supported by Sycamore Query is quite broad, including
filtering, aggregation, group-by, count, sorting, and mathematical operations.
Sycamore Query also supports LLM-powered query operators, such as
``LlmFilter``, which uses an LLM to filter data, and ``SummarizeData``, which
takes a Sycamore DocSet as input, and uses the LLM to produce a natural language
response given a prompt.

One can think of Sycamore Query as a much more sophisticated form of RAG,
using the power of Sycamore's data-processing operations alongside the
semantic power of an LLM, allowing you to run queries that would be
impossible for a RAG system to answer reliably.

The Sycamore Query UI
---------------------

The directory ``apps/query-ui`` in the Sycamore tree contains a web-based UI to
Sycamore Query, making it easy to experiment with queries, inspect query plans,
and debug the results. To run it, simply run the following in a checkout of
the Sycamore source tree:

.. code-block:: python
cd apps/query-ui
poetry install
poetry run queryui/main.py
..
By default, the UI will query OpenSearch running locally on port 9200.

Sycamore Query Plans
--------------------

The query plans generated by Sycamore Query are instances of the ``LogicalPlan``
class and represent a tree of operators that fetch, filter, aggregate, or process data
to produce a final result. You can inspect the query plan generated for a given query
in the UI (as described above) or by using the ``SycamoreQueryClient.generate_plan()``
method.

For example, a query plan for a query such as
**"What is the breakdown of aircraft types for incidents with substantial damage?**
might be as follows:

.. code-block:: python
{
"nodes": {
"0": QueryDatabase(
node_id=0,
description="Get all the incident reports with substantial aircraft damage",
input=None,
index="const_ntsb",
query={"match": {"properties.entity.aircraftDamage": "Substantial"}}
),
"1": TopK(
node_id=1,
description="Get the breakdown of aircraft types",
input=[0],
field="properties.entity.aircraft",
primary_field="properties.entity.accidentNumber",
K=100,
descending=False)
}
}
..
Essentially this is querying OpenSearch for records matching the given OpenSearch query,
and feeding the results to a ``TopK`` operation that performs a group-by on the
aircraft type.

``SycamoreQueryClient`` uses an LLM and knowledge of the OpenSearch
index schema to generate these query plans automatically from natural-language queries.
However, you can also construct a ``LogicalPlan`` directly, in code, and pass it to
``SycamoreQueryClient.run_plan()`` to run it.


Caching and performance
-----------------------

If you are running multiple queries that use the same intermediate results,
Sycamore Query can cache those intermediate results to avoid recomputing them
for subsequent queries. This is helpful from a performance and LLM cost perspective.

To use this feature, pass the ``cache_dir`` option to ``SycamoreQueryClient``:

.. code-block:: python
client = SycamoreQueryClient(cache_dir="/path/to/cache/dir")
..
Intermediate query results will be written to this directory and reused for
subsequent queries using the same ``cache_dir`` setting. If you wish to invalidate
the cache, simply remove the contents of your ``cache_dir``.

Sycamore Query can also cache the results of LLM calls, saving time and money when
many LLM operations are being performed. To use this feature, pass the
``s3_cache_path`` option to ``SycamoreQueryClient``.

.. code-block:: python
client = SycamoreQueryClient(s3_cache_path="/path/to/llm_cache/dir")
..
(Note that the name of this flag is a misnomer; it need not be an S3 path.)

The ``cache_dir`` and ``s3_cache_path`` settings can either be local filesystem
paths, or locations of S3 buckets (e.g., ``s3://your-bucket/query-cache``).

Debugging query execution
-------------------------

Sycamore Query will write the output of each node of the query plan as it runs
to a trace directory that you specify, allowing you to inspect the results as they
flow through the query plan. To use this feature, pass the ``trace_dir``
to ``SycamoreQueryClient``:

.. code-block:: python
client = SycamoreQueryClient(trace_dir="/path/to/trace/dir")
..
The contents of the ``trace_dir`` will be populated with files containing
the output of each query operator (note that these can be quite large, depending
on the amount of data you are querying). The layout of the directory will be:

.. code-block::
trace_dir/
<query_id>/
<node_id>/
doc-<doc_uuid_1>.pickle
doc-<doc_uuid_2>.pickle
...
..
where ``<query_id>`` is a unique ID representing the query that was executed,
``<node_id>`` is the node ID in the query plan, and ``<doc_uuid_NNNN>`` is a unique
ID for each document in the ``DocSet`` that was emitted by that node in the query plan.

These are Python pickle files containing the contents of each Sycamore ``Document`` emitted by
the corresponding query node. You can read them back with code like the following:

.. code-block:: python
import os
from sycamore.data import Document
docs = {}
for node_id in os.listdir(trace_dir):
docs[node_id] = []
for filename in os.listdir(os.path.join(trace_dir, node_id)):
f = os.path.join(trace_dir, node_id, filename)
with open(f, "rb") as file:
doc = Document.deserialize(f.read())
docs[node_id].append(doc)
..


0 comments on commit 154f7d5

Please sign in to comment.