Add query documentation.

aryn-ai · Oct 5, 2024 · 154f7d5 · 154f7d5
1 parent e1ea7bb
commit 154f7d5
Show file tree

Hide file tree

Showing 3 changed files with 218 additions and 2 deletions.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -120,5 +120,6 @@ More Resources
    /sycamore/using_jupyter.md
    /sycamore/transforms.rst
    /sycamore/connectors.rst
+   /sycamore/query.rst
    /sycamore/tutorials.rst
    /sycamore/APIs.rst
diff --git a/docs/source/sycamore/APIs/query.rst b/docs/source/sycamore/APIs/query.rst
@@ -1,7 +1,17 @@
 .. _Ref-Query:
 
 Query
-===========
+==============
 
-.. automodule:: sycamore.query
+This package allows you to build sophisticated LLM query-powered pipelines using Sycamore.
+
+.. automodule:: sycamore.query.client
+   :members:
+
+.. automodule:: sycamore.query.planner
    :members:
+
+.. automodule:: sycamore.query.logical_plan
+   :members:
+
+
diff --git a/docs/source/sycamore/query.rst b/docs/source/sycamore/query.rst
@@ -0,0 +1,205 @@
+Querying Data with Sycamore
+============================
+
+Beyond using Sycamore for document processing and loading data stores, Sycamore can also
+be used to implement sophisticated query pipelines over unstructured, semi-structured,
+and structured data. These query pipelines can use the full range of Sycamore operations
+to transform, filter, and aggregate data in a variety of ways. This provides a powerful
+abstraction that goes beyond conventional query languages (like SQL) and LLM-based data
+retrieval techniques (like RAG).
+
+Overview
+--------
+
+Sycamore Query consists of a few components, which are found in the
+``sycamore.query`` package, {doc}`documented here </sycamore/APIs/query>`.
+
+The ``SycamoreQueryClient`` class is the primary interface to the Sycamore
+Query engine. It is configured with a pointer to an underlying data source
+(currently, an OpenSearch index). The ``SycamoreQueryClient.query()`` method
+allows one to query the data source using a natural-language query, getting
+a Sycamore ``DocSet`` as a result.
+
+Here is a simple example:
+
+.. code-block:: python
+
+  from sycamore.query.client import SycamoreQueryClient
+
+  # If no data source is specified, the OpenSearch server running on localhost:9200 is used.
+  client = SycamoreQueryClient()
+
+  # Generate a query plan and run it for the given OpenSearch index.
+  result = client.query(
+    query="How many incidents were reported in bad weather conditions?",
+    index="const_ntsb")
+..
+
+Under the hood, this is using an LLM to generate a *query plan* from the
+natural-language query, which consists of a pipeline of Sycamore operators
+that retrieve, transform, and aggregate data. One can also use the
+``LogicalPlan`` class to build query plans directly, without the help of an LLM.
+
+The range of operators supported by Sycamore Query is quite broad, including
+filtering, aggregation, group-by, count, sorting, and mathematical operations.
+Sycamore Query also supports LLM-powered query operators, such as
+``LlmFilter``, which uses an LLM to filter data, and ``SummarizeData``, which
+takes a Sycamore DocSet as input, and uses the LLM to produce a natural language
+response given a prompt.
+
+One can think of Sycamore Query as a much more sophisticated form of RAG,
+using the power of Sycamore's data-processing operations alongside the
+semantic power of an LLM, allowing you to run queries that would be
+impossible for a RAG system to answer reliably.
+
+The Sycamore Query UI
+---------------------
+
+The directory ``apps/query-ui`` in the Sycamore tree contains a web-based UI to
+Sycamore Query, making it easy to experiment with queries, inspect query plans,
+and debug the results. To run it, simply run the following in a checkout of
+the Sycamore source tree:
+
+.. code-block:: python
+
+    cd apps/query-ui
+    poetry install
+    poetry run queryui/main.py
+..
+
+By default, the UI will query OpenSearch running locally on port 9200.
+
+Sycamore Query Plans
+--------------------
+
+The query plans generated by Sycamore Query are instances of the ``LogicalPlan``
+class and represent a tree of operators that fetch, filter, aggregate, or process data
+to produce a final result. You can inspect the query plan generated for a given query
+in the UI (as described above) or by using the ``SycamoreQueryClient.generate_plan()``
+method.
+
+For example, a query plan for a query such as 
+**"What is the breakdown of aircraft types for incidents with substantial damage?**
+might be as follows:
+
+.. code-block:: python
+
+    {
+        "nodes": {
+            "0": QueryDatabase(
+                    node_id=0,
+                    description="Get all the incident reports with substantial aircraft damage",
+                    input=None,
+                    index="const_ntsb",
+                    query={"match": {"properties.entity.aircraftDamage": "Substantial"}}
+                 ),
+            "1": TopK(
+                    node_id=1,
+                    description="Get the breakdown of aircraft types",
+                    input=[0],
+                    field="properties.entity.aircraft",
+                    primary_field="properties.entity.accidentNumber",
+                    K=100,
+                    descending=False)
+        }
+    }
+..
+
+Essentially this is querying OpenSearch for records matching the given OpenSearch query,
+and feeding the results to a ``TopK`` operation that performs a group-by on the
+aircraft type.
+
+``SycamoreQueryClient`` uses an LLM and knowledge of the OpenSearch
+index schema to generate these query plans automatically from natural-language queries.
+However, you can also construct a ``LogicalPlan`` directly, in code, and pass it to
+``SycamoreQueryClient.run_plan()`` to run it.
+
+
+Caching and performance
+-----------------------
+
+If you are running multiple queries that use the same intermediate results,
+Sycamore Query can cache those intermediate results to avoid recomputing them
+for subsequent queries. This is helpful from a performance and LLM cost perspective.
+
+To use this feature, pass the ``cache_dir`` option to ``SycamoreQueryClient``:
+
+.. code-block:: python
+
+    client = SycamoreQueryClient(cache_dir="/path/to/cache/dir")
+
+..
+
+Intermediate query results will be written to this directory and reused for
+subsequent queries using the same ``cache_dir`` setting. If you wish to invalidate
+the cache, simply remove the contents of your ``cache_dir``.
+
+Sycamore Query can also cache the results of LLM calls, saving time and money when
+many LLM operations are being performed. To use this feature, pass the 
+``s3_cache_path`` option to ``SycamoreQueryClient``.
+
+.. code-block:: python
+
+    client = SycamoreQueryClient(s3_cache_path="/path/to/llm_cache/dir")
+
+..
+
+(Note that the name of this flag is a misnomer; it need not be an S3 path.)
+
+The ``cache_dir`` and ``s3_cache_path`` settings can either be local filesystem
+paths, or locations of S3 buckets (e.g., ``s3://your-bucket/query-cache``).
+
+Debugging query execution
+-------------------------
+
+Sycamore Query will write the output of each node of the query plan as it runs
+to a trace directory that you specify, allowing you to inspect the results as they
+flow through the query plan. To use this feature, pass the ``trace_dir``
+to ``SycamoreQueryClient``:
+
+.. code-block:: python
+
+    client = SycamoreQueryClient(trace_dir="/path/to/trace/dir")
+
+..
+
+The contents of the ``trace_dir`` will be populated with files containing
+the output of each query operator (note that these can be quite large, depending
+on the amount of data you are querying). The layout of the directory will be:
+
+.. code-block::
+
+    trace_dir/
+        <query_id>/
+           <node_id>/
+             doc-<doc_uuid_1>.pickle
+             doc-<doc_uuid_2>.pickle
+             ...
+
+..
+
+where ``<query_id>`` is a unique ID representing the query that was executed,
+``<node_id>`` is the node ID in the query plan, and ``<doc_uuid_NNNN>`` is a unique
+ID for each document in the ``DocSet`` that was emitted by that node in the query plan.
+
+These are Python pickle files containing the contents of each Sycamore ``Document`` emitted by
+the corresponding query node. You can read them back with code like the following:
+
+.. code-block:: python
+
+        import os
+        from sycamore.data import Document
+
+        docs = {}
+        for node_id in os.listdir(trace_dir):
+            docs[node_id] = []
+            for filename in os.listdir(os.path.join(trace_dir, node_id)):
+                f = os.path.join(trace_dir, node_id, filename)
+                with open(f, "rb") as file:
+                    doc = Document.deserialize(f.read())
+                    docs[node_id].append(doc)
+
+..
+              
+
+