Skip to content

Commit

Permalink
Separate view-level and collection-level results (#8)
Browse files Browse the repository at this point in the history
* Separate view-level and collection-level results

* Add tests for Collection.ask

* Add documentation
  • Loading branch information
ludwiktrammer authored Apr 10, 2024
1 parent 74892b9 commit 89062dc
Show file tree
Hide file tree
Showing 22 changed files with 471 additions and 44 deletions.
18 changes: 14 additions & 4 deletions benchmark/dbally_benchmark/text2sql/metrics.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,31 @@
import time
from typing import Dict, List
from dataclasses import dataclass
from typing import Any, Dict, List

import pandas as pd
from dbally_benchmark.text2sql.text2sql_result import Text2SQLResult
from dbally_benchmark.utils import batch
from sqlalchemy import Engine, text

from dbally.data_models.execution_result import ExecutionResult

@dataclass
class _ExecutionResult:
"""
Represents the result of a single query execution
"""

results: List[Dict[str, Any]]
context: Dict[str, Any]
execution_time: float


def _run_query(query: str, engine: Engine) -> ExecutionResult:
def _run_query(query: str, engine: Engine) -> _ExecutionResult:
with engine.connect() as connection:
start_time = time.time()
rows = connection.execute(text(query)).fetchall()
execution_time = time.time() - start_time

return ExecutionResult(
return _ExecutionResult(
results=[dict(row._mapping) for row in rows], # pylint: disable=protected-access
execution_time=execution_time,
context={"sql": query},
Expand Down
3 changes: 3 additions & 0 deletions docs/concepts/collections.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,7 @@ my_collection.ask("Find me Italian recipes for soups")

In this scenario, the LLM first determines the most suitable view to address the query, and then that view is used to pull the relevant data.

!!! info
The result of a query is an [`ExecutionResult`][dbally.data_models.execution_result.ExecutionResult] object, which contains the data fetched by the view. It contains a `results` attribute that holds the actual data, structured as a list of dictionaries. The exact structure of these dictionaries depends on the view that was used to fetch the data, which can be obtained by looking at the `view_name` attribute of the `ExecutionResult` object.

It's possible for projects to feature several collections, each potentially housing a different set of views. Moreover, a single view can be associated with multiple collections, offering versatile usage across various contexts.
2 changes: 1 addition & 1 deletion docs/how-to/pandas_views.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How To: Use Pandas DataFrames with db-ally
# How-To: Use Pandas DataFrames with db-ally

In this guide, you will learn how to write [views](../concepts/views.md) that use [Pandas](https://pandas.pydata.org/) DataFrames as their data source. You will understand how to define such a view, create filters that operate on the DataFrame, and register it while providing it with the source DataFrame.

Expand Down
3 changes: 2 additions & 1 deletion docs/quickstart/quickstart2.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,4 +148,5 @@ That's it! You can apply similar techniques to any other filter that takes a str
To see the full example, you can find the code here: [quickstart2_code.py](quickstart2_code.py).

## Next Steps
See the [Tutorial](../tutorials.md) for a more in-depth guide on how to use db-ally.

Explore [Quickstart Part 3: Multiple Views](./quickstart3.md) to learn how to run queries with multiple views and display the results based on the view that was used to fetch the data.
130 changes: 130 additions & 0 deletions docs/quickstart/quickstart3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Quickstart Guide 3: Multiple Views

This guide continues from [Quickstart Guide 2](./quickstart2.md). It assumes that you have already set up the views and the collection. If not, please refer to the complete Part 2 code here: [quickstart2_code.py](quickstart2_code.py).

The guide illustrates how to use multiple views to handle queries requiring different types of data. `CandidateView` and `JobView` are used as examples.

## How does having multiple views work?

You can register multiple views with a collection. When you run a query, the AI model decides which view to use based on the query. This allows for handling diffrent kinds of queries with different views. Those views can be based on data from the same source (e.g., different tables in the same database), or from different sources (e.g. a database and an API).

Upon selecting the view, the AI model uses it to extract the relevant data from its data source. The query result is an [`ExecutionResult`][dbally.data_models.execution_result.ExecutionResult] object. It contains the data extracted by the view, along with other metadata including the name of the view that fetched the data.

## Defining the JobView

Our collection already has a registered `CandidateView` that allows us to extract candidates from the database. Let's now define a `JobView` that can fetch job offers, using a different data source - a Pandas DataFrame.

!!! info
For further information on utilizing Pandas DataFrames as a data source, refer to the [How-to: How-To: Use Pandas DataFrames with db-ally](../how-to/pandas_views.md) guide.

First, let's define the dataframe that will serve as our data source:

```python
import pandas as pd

jobs_data = pd.DataFrame.from_records([
{"title": "Data Scientist", "company": "Company A", "location": "New York", "salary": 100000},
{"title": "Data Engineer", "company": "Company B", "location": "San Francisco", "salary": 120000},
{"title": "Machine Learning Engineer", "company": "Company C", "location": "Berlin", "salary": 90000},
{"title": "Data Scientist", "company": "Company D", "location": "London", "salary": 110000},
{"title": "Data Scientist", "company": "Company E", "location": "Warsaw", "salary": 80000},
])
```

The dataframe holds job offer information, including the job title, company, location, and salary. Let's now define the `JobView` class:

```python
from dbally import decorators, DataFrameBaseView

class JobView(DataFrameBaseView):
"""
View for retrieving information about job offers.
"""

@decorators.view_filter()
def with_salary_at_least(self, salary: int) -> pd.Series:
"""
Filters job offers with a salary of at least `salary`.
"""
return self.df.salary >= salary

@decorators.view_filter()
def in_location(self, location: str) -> pd.Series:
"""
Filters job offers in a specific location.
"""
return self.df.location == location

@decorators.view_filter()
def from_company(self, company: str) -> pd.Series:
"""
Filters job offers from a specific company.
"""
return self.df.company == company
```

The `JobView` class inherits from `DataFrameBaseView`, a base class for views utilizing Pandas DataFrames as a data source. The class defines three filter methods: `with_salary_at_least`, `in_location`, and `from_company`. These methods filter the job offers based on salary, location, and company, respectively.

!!! note
The description of the view class is crucial for the AI model to understand the view's purpose. It helps the model decide which view to use for a specific query.

Now, let's register the `JobView` with the collection by adding this line to `main()`:

```python
collection.add(JobView, lambda: JobView(jobs_data))
```

## Running queries with multiple views

Now that we have both `CandidateView` and `JobView` registered with the collection, we can run queries involving both data types. First, let's define a function that can display query results:

```python
from dbally import ExecutionResult

def display_results(result: ExecutionResult):
if result.view_name == "CandidateView":
print(f"{len(result.results)} Candidates:")
for candidate in result.results:
print(f"{candidate['name']} - {candidate['skills']}")
elif result.view_name == "JobView":
print(f"{len(result.results)} Job Offers:")
for job in result.results:
print(f"{job['title']} at {job['company']} in {job['location']}")
```

The `display_result` function receives an `ExecutionResult` object as an argument and prints results based on the view that fetched the data. It shows how you can handle different types of data in the query results.

Now, let's try running a query about job offers:

```python
result = await collection.ask("Find me job offers in New York with a salary of at least 100000.")
display_results(result)
```

Based on our data, this should return the following output, provided by the `JobView`:

```
1 Job Offers:
Data Scientist at Company A in New York
```

Now, let's run a candidates query on the same collection:

```python
result = await collection.ask("Find me candidates from Poland")
display_results(result)
```

This query should yield the following output, provided by the `CandidateView`:

```
3 Candidates:
Yuri Kowalski - SQL;Database Management;Data Modeling
Julia Nowak - Adobe XD;Sketch;Figma
Anna Kowalska - AWS;Azure;Google Cloud
```

That wraps it up! You can find the full example code here: [quickstart3_code.py](quickstart3_code.py).

## Next Steps
Visit the [Tutorial](../tutorials.md) for a more comprehensive guide on how to use db-ally.
144 changes: 144 additions & 0 deletions docs/quickstart/quickstart3_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# pylint: disable=missing-return-doc, missing-param-doc, missing-function-docstring
import dbally
import os
import asyncio
from typing_extensions import Annotated

import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.ext.automap import automap_base
import pandas as pd

from dbally import decorators, SqlAlchemyBaseView, DataFrameBaseView, ExecutionResult
from dbally.audit.event_handlers.cli_event_handler import CLIEventHandler
from dbally.similarity import SimpleSqlAlchemyFetcher, FaissStore, SimilarityIndex
from dbally.embedding_client.openai import OpenAiEmbeddingClient

engine = create_engine('sqlite:///candidates.db')

Base = automap_base()
Base.prepare(autoload_with=engine)

Candidate = Base.classes.candidates

dbally.use_openai_llm(
openai_api_key=os.environ["OPENAI_API_KEY"],
model_name="gpt-3.5-turbo",
)

country_similarity = SimilarityIndex(
fetcher=SimpleSqlAlchemyFetcher(
engine,
table=Candidate,
column=Candidate.country,
),
store=FaissStore(
index_dir="./similarity_indexes",
index_name="country_similarity",
embedding_client=OpenAiEmbeddingClient(
api_key=os.environ["OPENAI_API_KEY"],
)
),
)

class CandidateView(SqlAlchemyBaseView):
"""
A view for retrieving candidates from the database.
"""
def get_select(self) -> sqlalchemy.Select:
"""
Creates the initial SqlAlchemy select object, which will be used to build the query.
"""
return sqlalchemy.select(Candidate)

@decorators.view_filter()
def at_least_experience(self, years: int) -> sqlalchemy.ColumnElement:
"""
Filters candidates with at least `years` of experience.
"""
return Candidate.years_of_experience >= years

@decorators.view_filter()
def senior_data_scientist_position(self) -> sqlalchemy.ColumnElement:
"""
Filters candidates that can be considered for a senior data scientist position.
"""
return sqlalchemy.and_(
Candidate.position.in_(["Data Scientist", "Machine Learning Engineer", "Data Engineer"]),
Candidate.years_of_experience >= 3,
)

@decorators.view_filter()
def from_country(self, country: Annotated[str, country_similarity]) -> sqlalchemy.ColumnElement:
"""
Filters candidates from a specific country.
"""
return Candidate.country == country


jobs_data = pd.DataFrame.from_records([
{"title": "Data Scientist", "company": "Company A", "location": "New York", "salary": 100000},
{"title": "Data Engineer", "company": "Company B", "location": "San Francisco", "salary": 120000},
{"title": "Machine Learning Engineer", "company": "Company C", "location": "Berlin", "salary": 90000},
{"title": "Data Scientist", "company": "Company D", "location": "London", "salary": 110000},
{"title": "Data Scientist", "company": "Company E", "location": "Warsaw", "salary": 80000},
])


class JobView(DataFrameBaseView):
"""
View for retrieving information about job offers.
"""

@decorators.view_filter()
def with_salary_at_least(self, salary: int) -> pd.Series:
"""
Filters job offers with a salary of at least `salary`.
"""
return self.df.salary >= salary

@decorators.view_filter()
def in_location(self, location: str) -> pd.Series:
"""
Filters job offers in a specific location.
"""
return self.df.location == location

@decorators.view_filter()
def from_company(self, company: str) -> pd.Series:
"""
Filters job offers from a specific company.
"""
return self.df.company == company


def display_results(result: ExecutionResult):
if result.view_name == "CandidateView":
print(f"{len(result.results)} Candidates:")
for candidate in result.results:
print(f"{candidate['name']} - {candidate['skills']}")
elif result.view_name == "JobView":
print(f"{len(result.results)} Job Offers:")
for job in result.results:
print(f"{job['title']} at {job['company']} in {job['location']}")


async def main():
await country_similarity.update()

collection = dbally.create_collection("recruitment")
# dbally.use_event_handler(CLIEventHandler())
collection.add(CandidateView, lambda: CandidateView(engine))
collection.add(JobView, lambda: JobView(jobs_data))

result = await collection.ask("Find me job offers in New York with a salary of at least 100000.")
display_results(result)

print()

result = await collection.ask("Find me candidates from Poland.")
display_results(result)


if __name__ == "__main__":
asyncio.run(main())
Binary file not shown.
Binary file not shown.
2 changes: 2 additions & 0 deletions docs/reference/collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@
- Collection
- add
- ask

::: dbally.data_models.execution_result.ExecutionResult
4 changes: 3 additions & 1 deletion docs/reference/views/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,6 @@
::: dbally.views.methods_base.MethodsBaseView
options:
members:
- list_filters
- list_filters

::: dbally.data_models.execution_result.ViewExecutionResult
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ nav:
- Quickstart:
- quickstart/index.md
- quickstart/quickstart2.md
- quickstart/quickstart3.md
- Tutorials:
- tutorials.md
- Concepts:
Expand Down
2 changes: 2 additions & 0 deletions src/dbally/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
""" dbally """

from dbally.data_models.execution_result import ExecutionResult
from dbally.views import decorators
from dbally.views.base import AbstractBaseView
from dbally.views.methods_base import MethodsBaseView
Expand All @@ -21,4 +22,5 @@
"Collection",
"AbstractBaseView",
"DataFrameBaseView",
"ExecutionResult",
]
Loading

0 comments on commit 89062dc

Please sign in to comment.