Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeform SQL execution with similarity indexes #24

Closed
wants to merge 3 commits into from

Conversation

ludwiktrammer
Copy link
Contributor

No description provided.

Copy link

github-actions bot commented May 8, 2024

Trivy scanning results.

@ludwiktrammer ludwiktrammer force-pushed the lt/freeform_similarity_execution branch from fe3ed7c to 6ebd9f2 Compare May 8, 2024 14:35
Copy link

github-actions bot commented May 8, 2024

badge

Code Coverage Summary

Filename                                                         Stmts    Miss  Cover    Missing
-------------------------------------------------------------  -------  ------  -------  -------------------------------------------------
dbally/_main.py                                                     12       0  100.00%
dbally/collection.py                                                92       2  97.83%   216, 244
dbally/assistants/base.py                                           23       0  100.00%
dbally/assistants/openai.py                                         59       2  96.61%   59-76
dbally/audit/event_span.py                                           8       2  75.00%   12, 22
dbally/audit/event_tracker.py                                       34      11  67.65%   35, 48, 59, 69, 83-92
dbally/audit/event_handlers/base.py                                 15       0  100.00%
dbally/audit/event_handlers/cli_event_handler.py                    42      27  35.71%   9-12, 38-39, 42-46, 56-58, 70-78, 90-93, 104-107
dbally/audit/event_handlers/langsmith_event_handler.py              25      21  16.00%   6-92
dbally/data_models/audit.py                                         22       0  100.00%
dbally/data_models/execution_result.py                              14       0  100.00%
dbally/data_models/llm_options.py                                   13       0  100.00%
dbally/data_models/prompts/common_validation_utils.py               15       0  100.00%
dbally/data_models/prompts/iql_prompt_template.py                   13       1  92.31%   39
dbally/data_models/prompts/nl_responder_prompt_template.py           8       0  100.00%
dbally/data_models/prompts/prompt_template.py                       28       2  92.86%   27, 35
dbally/data_models/prompts/query_explainer_prompt_template.py        8       0  100.00%
dbally/data_models/prompts/view_selector_prompt_template.py         12       2  83.33%   33-34
dbally/embedding_client/base.py                                      5       0  100.00%
dbally/embedding_client/openai.py                                   19      14  26.32%   20-30, 42-52
dbally/iql/_exceptions.py                                           24       1  95.83%   33
dbally/iql/_processor.py                                            78       7  91.03%   15, 43, 60, 66, 72, 81, 87
dbally/iql/_query.py                                                12       1  91.67%   7
dbally/iql/_type_validators.py                                      39       2  94.87%   24, 28
dbally/iql/syntax.py                                                36       9  75.00%   6-9, 27, 36, 60, 63-66
dbally/iql_generator/iql_generator.py                               32       0  100.00%
dbally/llm_client/base.py                                           21      10  52.38%   24-25, 50-70
dbally/llm_client/openai_client.py                                  21      21  0.00%    1-62
dbally/nl_responder/nl_responder.py                                 31       5  83.87%   75, 82-90
dbally/nl_responder/token_counters.py                               23      10  56.52%   24-25, 53-63
dbally/prompts/prompt_builder.py                                    20       3  85.00%   6, 26-27
dbally/similarity/chroma_store.py                                   35       0  100.00%
dbally/similarity/detector.py                                       73       6  91.78%   21, 200-204
dbally/similarity/faiss_store.py                                    36      33  8.33%    5-94
dbally/similarity/fetcher.py                                         5       0  100.00%
dbally/similarity/index.py                                          18       0  100.00%
dbally/similarity/sqlalchemy_base.py                                40      21  47.50%   17, 35-37, 48-50, 59, 68-71, 81-87, 105-108
dbally/similarity/store.py                                           7       0  100.00%
dbally/utils/errors.py                                               2       0  100.00%
dbally/view_selection/base.py                                        6       0  100.00%
dbally/view_selection/llm_view_selector.py                          20       0  100.00%
dbally/view_selection/random_view_selector.py                        9       9  0.00%    1-27
dbally/views/base.py                                                 7       0  100.00%
dbally/views/decorators.py                                           6       0  100.00%
dbally/views/exposed_functions.py                                   33       1  96.97%   24
dbally/views/methods_base.py                                        34       2  94.12%   75, 83
dbally/views/pandas_base.py                                         33       1  96.97%   64
dbally/views/sqlalchemy_base.py                                     37       7  81.08%   48, 63-65, 83-87
dbally/views/structured.py                                          34       1  97.06%   30
dbally/views/freeform/text2sql/_autodiscovery.py                   113      18  84.07%   107-108, 272, 275, 288-300, 304, 319-321, 333-334
dbally/views/freeform/text2sql/_config.py                           20       0  100.00%
dbally/views/freeform/text2sql/_errors.py                            6       3  50.00%   10-12
dbally/views/freeform/text2sql/_view.py                             76      13  82.89%   67, 70, 73, 76, 79, 91, 136, 140-143, 146, 172
dbally_cli/main.py                                                   5       5  0.00%    1-11
dbally_cli/similarity.py                                            21       0  100.00%
TOTAL                                                             1480     273  81.55%

Diff against main

Filename                                   Stmts    Miss  Cover
---------------------------------------  -------  ------  -------
dbally/views/freeform/text2sql/_view.py      +28      +7  -4.61%
TOTAL                                        +28      +7  -0.13%

Results for commit: e25280c

Minimum allowed coverage is 60%

♻️ This comment has been updated with latest results

@@ -91,22 +154,28 @@ async def ask(

async def _generate_sql(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ludwiktrammer wdyt about introducing strategy pattern here?

Initially we would have 2 strategies:

  • in case there is no similarity indexes defined we use prompt without "parameters" logic (to reduce token usage)
  • with indexes defined we use parameter binding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, although to be honest I like the decoupling of user-provided values and the SQL code as a general idea, even without similarity indexes. It seems to me that it makes SQL injections less likely (although still possible) and gives us more visibility.

if "name" not in data or not isinstance(data["name"], str):
raise ValueError("Parameter name should be a string")

if "value" not in data or not isinstance(data["value"], str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why value must be a string? JSON supports integers and for some questions GPT returned it which was causing Exceptions. Is it expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I was thinking about the fact that only string values can be passed to similarity indexes, but the LLM is not wrong to use placeholders for values of other types here. I will fix this.

@@ -91,22 +156,28 @@ async def ask(

async def _generate_sql(
self, query: str, conversation: PromptTemplate, llm_client: LLMClient, event_tracker: EventTracker
) -> Tuple[str, PromptTemplate]:
) -> Tuple[str, List[SQLParameterOption], PromptTemplate]:
response = await llm_client.text_generation(
template=conversation,
fmt={"tables": self._get_tables_context(), "dialect": self._engine.dialect.name, "question": query},
event_tracker=event_tracker,
)

conversation = conversation.add_assistant_message(response)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow AzureOpen AI still returns this json {} format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the ```json marking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -91,22 +156,28 @@ async def ask(

async def _generate_sql(
self, query: str, conversation: PromptTemplate, llm_client: LLMClient, event_tracker: EventTracker
) -> Tuple[str, PromptTemplate]:
) -> Tuple[str, List[SQLParameterOption], PromptTemplate]:
response = await llm_client.text_generation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All { and } characters should be escaped. If they are not prompt_builder fails in the case of retried query as this JSON response is appended to the conversation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the prompt builder fail? Generally, I believe that strings should be escaped when they are put into medium that needs the escaping, before that it should live in its canonical non-escaped form. So most likely it's prompt builder itself that should do the escaping - although its hard to tell, since I haven't seen the exact error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact message is just what I pasted below. It fails on the string.format() as format function expects placeholder to be between {} It interprets JSON as a big placeholder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be tricky to implement it inside prompt builder as it is not awere of what is placeholder and what is JSON

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

@ludwiktrammer ludwiktrammer May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem happening in db-ally's code or your code? Generally speaking prompt consists of two parts: prompt template (which is static) and dynamic values that are injected into placeholders defined in the template. The json is a dynamic value. The root of the problem seems to be that it is added as part of the template itself, and not as a value injected into the template?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the db-ally code. I just copied it to the notebook to showcase the problem clearly. It occurs because in _generate_sql method in 166 line we do add_assistant_message and it is just string with JSON, this causes this exception.

@@ -65,15 +130,15 @@ async def ask(
# We want to catch all exceptions to retry the process.
# pylint: disable=broad-except
try:
sql, conversation = await self._generate_sql(query, conversation, llm_client, event_tracker)
sql, parameters, conversation = await self._generate_sql(query, conversation, llm_client, event_tracker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an output we have parameterized sql query without true values, is this behaviour expected? Should the user take care of it by themselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "true values"? The user-provided values should be present under "value" for each parameter object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so at the pointed place the sql variable will look like

SELECT * FROM table WHERE X = :placeholder

From this point we do not modify sql variable to fill these :placeholder. If someone runs it in dry_run mode we do not even provide parameters. So if someone want to execute this query later, how can they do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I get it now. Here, as an output of _generate_sql we definitely want the parameterized SQL but the ViewExecutionResult returned by ask should probably include the parameters, in one way or another.

@mhordynski mhordynski closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants