Separate view-level and collection-level results (#8)

* Separate view-level and collection-level results * Add tests for Collection.ask * Add documentation
deepsense-ai · Apr 10, 2024 · 89062dc · 89062dc
1 parent 74892b9
commit 89062dc
Show file tree

Hide file tree

Showing 22 changed files with 471 additions and 44 deletions.
diff --git a/benchmark/dbally_benchmark/text2sql/metrics.py b/benchmark/dbally_benchmark/text2sql/metrics.py
@@ -1,21 +1,31 @@
 import time
-from typing import Dict, List
+from dataclasses import dataclass
+from typing import Any, Dict, List
 
 import pandas as pd
 from dbally_benchmark.text2sql.text2sql_result import Text2SQLResult
 from dbally_benchmark.utils import batch
 from sqlalchemy import Engine, text
 
-from dbally.data_models.execution_result import ExecutionResult
+
+@dataclass
+class _ExecutionResult:
+    """
+    Represents the result of a single query execution
+    """
+
+    results: List[Dict[str, Any]]
+    context: Dict[str, Any]
+    execution_time: float
 
 
-def _run_query(query: str, engine: Engine) -> ExecutionResult:
+def _run_query(query: str, engine: Engine) -> _ExecutionResult:
     with engine.connect() as connection:
         start_time = time.time()
         rows = connection.execute(text(query)).fetchall()
         execution_time = time.time() - start_time
 
-    return ExecutionResult(
+    return _ExecutionResult(
         results=[dict(row._mapping) for row in rows],  # pylint: disable=protected-access
         execution_time=execution_time,
         context={"sql": query},

diff --git a/docs/concepts/collections.md b/docs/concepts/collections.md
@@ -25,4 +25,7 @@ my_collection.ask("Find me Italian recipes for soups")
 
 In this scenario, the LLM first determines the most suitable view to address the query, and then that view is used to pull the relevant data.
 
+!!! info
+    The result of a query is an [`ExecutionResult`][dbally.data_models.execution_result.ExecutionResult] object, which contains the data fetched by the view. It contains a `results` attribute that holds the actual data, structured as a list of dictionaries. The exact structure of these dictionaries depends on the view that was used to fetch the data, which can be obtained by looking at the `view_name` attribute of the `ExecutionResult` object.
+
 It's possible for projects to feature several collections, each potentially housing a different set of views. Moreover, a single view can be associated with multiple collections, offering versatile usage across various contexts.
diff --git a/docs/how-to/pandas_views.md b/docs/how-to/pandas_views.md
@@ -1,4 +1,4 @@
-# How To: Use Pandas DataFrames with db-ally
+# How-To: Use Pandas DataFrames with db-ally
 
 In this guide, you will learn how to write [views](../concepts/views.md) that use [Pandas](https://pandas.pydata.org/) DataFrames as their data source. You will understand how to define such a view, create filters that operate on the DataFrame, and register it while providing it with the source DataFrame.
 

diff --git a/docs/quickstart/quickstart2.md b/docs/quickstart/quickstart2.md
@@ -148,4 +148,5 @@ That's it! You can apply similar techniques to any other filter that takes a str
 To see the full example, you can find the code here: [quickstart2_code.py](quickstart2_code.py).
 
 ## Next Steps
-See the [Tutorial](../tutorials.md) for a more in-depth guide on how to use db-ally.
+
+Explore [Quickstart Part 3: Multiple Views](./quickstart3.md) to learn how to run queries with multiple views and display the results based on the view that was used to fetch the data.
diff --git a/docs/quickstart/quickstart3.md b/docs/quickstart/quickstart3.md
@@ -0,0 +1,130 @@
+# Quickstart Guide 3: Multiple Views
+
+This guide continues from [Quickstart Guide 2](./quickstart2.md). It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 2 code here: [quickstart2_code.py](quickstart2_code.py).
+
+The guide illustrates how to use multiple views to handle queries requiring different types of data. `CandidateView` and `JobView` are used as examples.
+
+## How does having multiple views work?
+
+You can register multiple views with a collection. When you run a query, the AI model decides which view to use based on the query. This allows for handling diffrent kinds of queries with different views. Those views can be based on data from the same source (e.g., different tables in the same database), or from different sources (e.g. a database and an API).
+
+Upon selecting the view, the AI model uses it to extract the relevant data from its data source. The query result is an [`ExecutionResult`][dbally.data_models.execution_result.ExecutionResult] object. It contains the data extracted by the view, along with other metadata including the name of the view that fetched the data.
+
+## Defining the JobView
+
+Our collection already has a registered `CandidateView` that allows us to extract candidates from the database. Let's now define a `JobView` that can fetch job offers, using a different data source - a Pandas DataFrame.
+
+!!! info
+    For further information on utilizing Pandas DataFrames as a data source, refer to the [How-to: How-To: Use Pandas DataFrames with db-ally](../how-to/pandas_views.md) guide.
+
+First, let's define the dataframe that will serve as our data source:
+
+```python
+import pandas as pd
+
+jobs_data = pd.DataFrame.from_records([
+    {"title": "Data Scientist", "company": "Company A", "location": "New York", "salary": 100000},
+    {"title": "Data Engineer", "company": "Company B", "location": "San Francisco", "salary": 120000},
+    {"title": "Machine Learning Engineer", "company": "Company C", "location": "Berlin", "salary": 90000},
+    {"title": "Data Scientist", "company": "Company D", "location": "London", "salary": 110000},
+    {"title": "Data Scientist", "company": "Company E", "location": "Warsaw", "salary": 80000},
+])
+```
+
+The dataframe holds job offer information, including the job title, company, location, and salary. Let's now define the `JobView` class:
+
+```python
+from dbally import decorators, DataFrameBaseView
+
+class JobView(DataFrameBaseView):
+    """
+    View for retrieving information about job offers.
+    """
+
+    @decorators.view_filter()
+    def with_salary_at_least(self, salary: int) -> pd.Series:
+        """
+        Filters job offers with a salary of at least `salary`.
+        """
+        return self.df.salary >= salary
+
+    @decorators.view_filter()
+    def in_location(self, location: str) -> pd.Series:
+        """
+        Filters job offers in a specific location.
+        """
+        return self.df.location == location
+
+    @decorators.view_filter()
+    def from_company(self, company: str) -> pd.Series:
+        """
+        Filters job offers from a specific company.
+        """
+        return self.df.company == company
+```
+
+The `JobView` class inherits from `DataFrameBaseView`, a base class for views utilizing Pandas DataFrames as a data source. The class defines three filter methods: `with_salary_at_least`, `in_location`, and `from_company`. These methods filter the job offers based on salary, location, and company, respectively.
+
+!!! note
+    The description of the view class is crucial for the AI model to understand the view's purpose. It helps the model decide which view to use for a specific query.
+
+Now, let's register the `JobView` with the collection by adding this line to `main()`:
+
+```python
+collection.add(JobView, lambda: JobView(jobs_data))
+```
+
+## Running queries with multiple views
+
+Now that we have both `CandidateView` and `JobView` registered with the collection, we can run queries involving both data types. First, let's define a function that can display query results:
+
+```python
+from dbally import ExecutionResult
+
+def display_results(result: ExecutionResult):
+    if result.view_name == "CandidateView":
+        print(f"{len(result.results)} Candidates:")
+        for candidate in result.results:
+            print(f"{candidate['name']} - {candidate['skills']}")
+    elif result.view_name == "JobView":
+        print(f"{len(result.results)} Job Offers:")
+        for job in result.results:
+            print(f"{job['title']} at {job['company']} in {job['location']}")
+```
+
+The `display_result` function receives an `ExecutionResult` object as an argument and prints results based on the view that fetched the data. It shows how you can handle different types of data in the query results.
+
+Now, let's try running a query about job offers:
+
+```python
+result = await collection.ask("Find me job offers in New York with a salary of at least 100000.")
+display_results(result)
+```
+
+Based on our data, this should return the following output, provided by the `JobView`:
+
+```
+1 Job Offers:
+Data Scientist at Company A in New York
+```
+
+Now, let's run a candidates query on the same collection:
+
+```python
+result = await collection.ask("Find me candidates from Poland")
+display_results(result)
+```
+
+This query should yield the following output, provided by the `CandidateView`:
+
+```
+3 Candidates:
+Yuri Kowalski - SQL;Database Management;Data Modeling
+Julia Nowak - Adobe XD;Sketch;Figma
+Anna Kowalska - AWS;Azure;Google Cloud
+```
+
+That wraps it up! You can find the full example code here: [quickstart3_code.py](quickstart3_code.py).
+
+## Next Steps
+Visit the [Tutorial](../tutorials.md) for a more comprehensive guide on how to use db-ally.
diff --git a/docs/quickstart/quickstart3_code.py b/docs/quickstart/quickstart3_code.py
@@ -0,0 +1,144 @@
+# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring
+import dbally
+import os
+import asyncio
+from typing_extensions import Annotated
+
+import sqlalchemy
+from sqlalchemy import create_engine
+from sqlalchemy.ext.automap import automap_base
+import pandas as pd
+
+from dbally import decorators, SqlAlchemyBaseView, DataFrameBaseView, ExecutionResult
+from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler
+from dbally.similarity import SimpleSqlAlchemyFetcher, FaissStore, SimilarityIndex
+from dbally.embedding_client.openai import OpenAiEmbeddingClient
+
+engine = create_engine('sqlite:///candidates.db')
+
+Base = automap_base()
+Base.prepare(autoload_with=engine)
+
+Candidate = Base.classes.candidates
+
+dbally.use_openai_llm(
+    openai_api_key=os.environ["OPENAI_API_KEY"],
+    model_name="gpt-3.5-turbo",
+)
+
+country_similarity = SimilarityIndex(
+        fetcher=SimpleSqlAlchemyFetcher(
+        engine,
+        table=Candidate,
+        column=Candidate.country,
+    ),
+    store=FaissStore(
+        index_dir="./similarity_indexes",
+        index_name="country_similarity",
+        embedding_client=OpenAiEmbeddingClient(
+            api_key=os.environ["OPENAI_API_KEY"],
+        )
+    ),
+)
+
+class CandidateView(SqlAlchemyBaseView):
+    """
+    A view for retrieving candidates from the database.
+    """
+    def get_select(self) -> sqlalchemy.Select:
+        """
+        Creates the initial SqlAlchemy select object, which will be used to build the query.
+        """
+        return sqlalchemy.select(Candidate)
+
+    @decorators.view_filter()
+    def at_least_experience(self, years: int) -> sqlalchemy.ColumnElement:
+        """
+        Filters candidates with at least `years` of experience.
+        """
+        return Candidate.years_of_experience >= years
+
+    @decorators.view_filter()
+    def senior_data_scientist_position(self) -> sqlalchemy.ColumnElement:
+        """
+        Filters candidates that can be considered for a senior data scientist position.
+        """
+        return sqlalchemy.and_(
+            Candidate.position.in_(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]),
+            Candidate.years_of_experience >= 3,
+        )
+
+    @decorators.view_filter()
+    def from_country(self, country: Annotated[str, country_similarity]) -> sqlalchemy.ColumnElement:
+        """
+        Filters candidates from a specific country.
+        """
+        return Candidate.country == country
+
+
+jobs_data = pd.DataFrame.from_records([
+    {"title": "Data Scientist", "company": "Company A", "location": "New York", "salary": 100000},
+    {"title": "Data Engineer", "company": "Company B", "location": "San Francisco", "salary": 120000},
+    {"title": "Machine Learning Engineer", "company": "Company C", "location": "Berlin", "salary": 90000},
+    {"title": "Data Scientist", "company": "Company D", "location": "London", "salary": 110000},
+    {"title": "Data Scientist", "company": "Company E", "location": "Warsaw", "salary": 80000},
+])
+
+
+class JobView(DataFrameBaseView):
+    """
+    View for retrieving information about job offers.
+    """
+
+    @decorators.view_filter()
+    def with_salary_at_least(self, salary: int) -> pd.Series:
+        """
+        Filters job offers with a salary of at least `salary`.
+        """
+        return self.df.salary >= salary
+
+    @decorators.view_filter()
+    def in_location(self, location: str) -> pd.Series:
+        """
+        Filters job offers in a specific location.
+        """
+        return self.df.location == location
+
+    @decorators.view_filter()
+    def from_company(self, company: str) -> pd.Series:
+        """
+        Filters job offers from a specific company.
+        """
+        return self.df.company == company
+
+
+def display_results(result: ExecutionResult):
+    if result.view_name == "CandidateView":
+        print(f"{len(result.results)} Candidates:")
+        for candidate in result.results:
+            print(f"{candidate['name']} - {candidate['skills']}")
+    elif result.view_name == "JobView":
+        print(f"{len(result.results)} Job Offers:")
+        for job in result.results:
+            print(f"{job['title']} at {job['company']} in {job['location']}")
+
+
+async def main():
+    await country_similarity.update()
+
+    collection = dbally.create_collection("recruitment")
+    # dbally.use_event_handler(CLIEventHandler())
+    collection.add(CandidateView, lambda: CandidateView(engine))
+    collection.add(JobView, lambda: JobView(jobs_data))
+
+    result = await collection.ask("Find me job offers in New York with a salary of at least 100000.")
+    display_results(result)
+
+    print()
+
+    result = await collection.ask("Find me candidates from Poland.")
+    display_results(result)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/docs/quickstart/similarity_indexes/country_similarity.index b/docs/quickstart/similarity_indexes/country_similarity.index
diff --git a/docs/quickstart/similarity_indexes/country_similarity.npy b/docs/quickstart/similarity_indexes/country_similarity.npy
diff --git a/docs/reference/collection.md b/docs/reference/collection.md
@@ -9,3 +9,5 @@
         - Collection
         - add
         - ask
+
+::: dbally.data_models.execution_result.ExecutionResult
diff --git a/docs/reference/views/index.md b/docs/reference/views/index.md
@@ -19,4 +19,6 @@
 ::: dbally.views.methods_base.MethodsBaseView
     options:
         members:
-        - list_filters
+        - list_filters
+
+::: dbally.data_models.execution_result.ViewExecutionResult
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -5,6 +5,7 @@ nav:
   - Quickstart:
       - quickstart/index.md
       - quickstart/quickstart2.md
+      - quickstart/quickstart3.md
   - Tutorials:
       - tutorials.md
   - Concepts:

diff --git a/src/dbally/__init__.py b/src/dbally/__init__.py
@@ -1,5 +1,6 @@
 """ dbally """
 
+from dbally.data_models.execution_result import ExecutionResult
 from dbally.views import decorators
 from dbally.views.base import AbstractBaseView
 from dbally.views.methods_base import MethodsBaseView
@@ -21,4 +22,5 @@
     "Collection",
     "AbstractBaseView",
     "DataFrameBaseView",
+    "ExecutionResult",
 ]
-Original file line number
+Diff line change
@@ Expand Up / @@ -9,3 +9,5 @@ @@
             - Collection
             - add
             - ask
+    ::: dbally.data_models.execution_result.ExecutionResult