Freeform SQL execution with similarity indexes #24

ludwiktrammer · 2024-05-08T14:30:43Z

No description provided.

github-actions · 2024-05-08T14:31:32Z

Trivy scanning results.

github-actions · 2024-05-08T15:18:53Z

Code Coverage Summary

Filename                                                         Stmts    Miss  Cover    Missing
-------------------------------------------------------------  -------  ------  -------  -------------------------------------------------
dbally/_main.py                                                     12       0  100.00%
dbally/collection.py                                                92       2  97.83%   216, 244
dbally/assistants/base.py                                           23       0  100.00%
dbally/assistants/openai.py                                         59       2  96.61%   59-76
dbally/audit/event_span.py                                           8       2  75.00%   12, 22
dbally/audit/event_tracker.py                                       34      11  67.65%   35, 48, 59, 69, 83-92
dbally/audit/event_handlers/base.py                                 15       0  100.00%
dbally/audit/event_handlers/cli_event_handler.py                    42      27  35.71%   9-12, 38-39, 42-46, 56-58, 70-78, 90-93, 104-107
dbally/audit/event_handlers/langsmith_event_handler.py              25      21  16.00%   6-92
dbally/data_models/audit.py                                         22       0  100.00%
dbally/data_models/execution_result.py                              14       0  100.00%
dbally/data_models/llm_options.py                                   13       0  100.00%
dbally/data_models/prompts/common_validation_utils.py               15       0  100.00%
dbally/data_models/prompts/iql_prompt_template.py                   13       1  92.31%   39
dbally/data_models/prompts/nl_responder_prompt_template.py           8       0  100.00%
dbally/data_models/prompts/prompt_template.py                       28       2  92.86%   27, 35
dbally/data_models/prompts/query_explainer_prompt_template.py        8       0  100.00%
dbally/data_models/prompts/view_selector_prompt_template.py         12       2  83.33%   33-34
dbally/embedding_client/base.py                                      5       0  100.00%
dbally/embedding_client/openai.py                                   19      14  26.32%   20-30, 42-52
dbally/iql/_exceptions.py                                           24       1  95.83%   33
dbally/iql/_processor.py                                            78       7  91.03%   15, 43, 60, 66, 72, 81, 87
dbally/iql/_query.py                                                12       1  91.67%   7
dbally/iql/_type_validators.py                                      39       2  94.87%   24, 28
dbally/iql/syntax.py                                                36       9  75.00%   6-9, 27, 36, 60, 63-66
dbally/iql_generator/iql_generator.py                               32       0  100.00%
dbally/llm_client/base.py                                           21      10  52.38%   24-25, 50-70
dbally/llm_client/openai_client.py                                  21      21  0.00%    1-62
dbally/nl_responder/nl_responder.py                                 31       5  83.87%   75, 82-90
dbally/nl_responder/token_counters.py                               23      10  56.52%   24-25, 53-63
dbally/prompts/prompt_builder.py                                    20       3  85.00%   6, 26-27
dbally/similarity/chroma_store.py                                   35       0  100.00%
dbally/similarity/detector.py                                       73       6  91.78%   21, 200-204
dbally/similarity/faiss_store.py                                    36      33  8.33%    5-94
dbally/similarity/fetcher.py                                         5       0  100.00%
dbally/similarity/index.py                                          18       0  100.00%
dbally/similarity/sqlalchemy_base.py                                40      21  47.50%   17, 35-37, 48-50, 59, 68-71, 81-87, 105-108
dbally/similarity/store.py                                           7       0  100.00%
dbally/utils/errors.py                                               2       0  100.00%
dbally/view_selection/base.py                                        6       0  100.00%
dbally/view_selection/llm_view_selector.py                          20       0  100.00%
dbally/view_selection/random_view_selector.py                        9       9  0.00%    1-27
dbally/views/base.py                                                 7       0  100.00%
dbally/views/decorators.py                                           6       0  100.00%
dbally/views/exposed_functions.py                                   33       1  96.97%   24
dbally/views/methods_base.py                                        34       2  94.12%   75, 83
dbally/views/pandas_base.py                                         33       1  96.97%   64
dbally/views/sqlalchemy_base.py                                     37       7  81.08%   48, 63-65, 83-87
dbally/views/structured.py                                          34       1  97.06%   30
dbally/views/freeform/text2sql/_autodiscovery.py                   113      18  84.07%   107-108, 272, 275, 288-300, 304, 319-321, 333-334
dbally/views/freeform/text2sql/_config.py                           20       0  100.00%
dbally/views/freeform/text2sql/_errors.py                            6       3  50.00%   10-12
dbally/views/freeform/text2sql/_view.py                             76      13  82.89%   67, 70, 73, 76, 79, 91, 136, 140-143, 146, 172
dbally_cli/main.py                                                   5       5  0.00%    1-11
dbally_cli/similarity.py                                            21       0  100.00%
TOTAL                                                             1480     273  81.55%

Diff against main

Filename                                   Stmts    Miss  Cover
---------------------------------------  -------  ------  -------
dbally/views/freeform/text2sql/_view.py      +28      +7  -4.61%
TOTAL                                        +28      +7  -0.13%

Results for commit: e25280c

Minimum allowed coverage is 60%

♻️ This comment has been updated with latest results

mhordynski · 2024-05-09T06:55:56Z

src/dbally/views/freeform/text2sql/_view.py

@@ -91,22 +154,28 @@ async def ask(

    async def _generate_sql(


@ludwiktrammer wdyt about introducing strategy pattern here?

Initially we would have 2 strategies:

in case there is no similarity indexes defined we use prompt without "parameters" logic (to reduce token usage)

with indexes defined we use parameter binding

Maybe, although to be honest I like the decoupling of user-provided values and the SQL code as a general idea, even without similarity indexes. It seems to me that it makes SQL injections less likely (although still possible) and gives us more visibility.

ds-sebastianchwilczynski · 2024-05-10T07:50:54Z

src/dbally/views/freeform/text2sql/_view.py

+        if "name" not in data or not isinstance(data["name"], str):
+            raise ValueError("Parameter name should be a string")
+
+        if "value" not in data or not isinstance(data["value"], str):


Why value must be a string? JSON supports integers and for some questions GPT returned it which was causing Exceptions. Is it expected?

Good catch. I was thinking about the fact that only string values can be passed to similarity indexes, but the LLM is not wrong to use placeholders for values of other types here. I will fix this.

ds-sebastianchwilczynski · 2024-05-10T07:51:42Z

src/dbally/views/freeform/text2sql/_view.py

@@ -91,22 +156,28 @@ async def ask(

    async def _generate_sql(
        self, query: str, conversation: PromptTemplate, llm_client: LLMClient, event_tracker: EventTracker
-    ) -> Tuple[str, PromptTemplate]:
+    ) -> Tuple[str, List[SQLParameterOption], PromptTemplate]:
        response = await llm_client.text_generation(
            template=conversation,
            fmt={"tables": self._get_tables_context(), "dialect": self._engine.dialect.name, "question": query},
            event_tracker=event_tracker,
        )

        conversation = conversation.add_assistant_message(response)


Somehow AzureOpen AI still returns this json {} format.

You mean the ```json marking?

ds-sebastianchwilczynski · 2024-05-10T07:52:28Z

src/dbally/views/freeform/text2sql/_view.py

@@ -91,22 +156,28 @@ async def ask(

    async def _generate_sql(
        self, query: str, conversation: PromptTemplate, llm_client: LLMClient, event_tracker: EventTracker
-    ) -> Tuple[str, PromptTemplate]:
+    ) -> Tuple[str, List[SQLParameterOption], PromptTemplate]:
        response = await llm_client.text_generation(


All { and } characters should be escaped. If they are not prompt_builder fails in the case of retried query as this JSON response is appended to the conversation

How does the prompt builder fail? Generally, I believe that strings should be escaped when they are put into medium that needs the escaping, before that it should live in its canonical non-escaped form. So most likely it's prompt builder itself that should do the escaping - although its hard to tell, since I haven't seen the exact error message.

The exact message is just what I pasted below. It fails on the string.format() as format function expects placeholder to be between {} It interprets JSON as a big placeholder.

I think this may be tricky to implement it inside prompt builder as it is not awere of what is placeholder and what is JSON

Is the problem happening in db-ally's code or your code? Generally speaking prompt consists of two parts: prompt template (which is static) and dynamic values that are injected into placeholders defined in the template. The json is a dynamic value. The root of the problem seems to be that it is added as part of the template itself, and not as a value injected into the template?

In the db-ally code. I just copied it to the notebook to showcase the problem clearly. It occurs because in _generate_sql method in 166 line we do add_assistant_message and it is just string with JSON, this causes this exception.

ds-sebastianchwilczynski · 2024-05-10T11:44:55Z

src/dbally/views/freeform/text2sql/_view.py

@@ -65,15 +130,15 @@ async def ask(
            # We want to catch all exceptions to retry the process.
            # pylint: disable=broad-except
            try:
-                sql, conversation = await self._generate_sql(query, conversation, llm_client, event_tracker)
+                sql, parameters, conversation = await self._generate_sql(query, conversation, llm_client, event_tracker)


As an output we have parameterized sql query without true values, is this behaviour expected? Should the user take care of it by themselves?

What do you mean by "true values"? The user-provided values should be present under "value" for each parameter object.

Right so at the pointed place the sql variable will look like

SELECT * FROM table WHERE X = :placeholder

From this point we do not modify sql variable to fill these :placeholder. If someone runs it in dry_run mode we do not even provide parameters. So if someone want to execute this query later, how can they do this?

Oh, I get it now. Here, as an output of _generate_sql we definitely want the parameterized SQL but the ViewExecutionResult returned by ask should probably include the parameters, in one way or another.

Freeform SQL execution with similarity indexes

6ebd9f2

ludwiktrammer force-pushed the lt/freeform_similarity_execution branch from fe3ed7c to 6ebd9f2 Compare May 8, 2024 14:35

Fix tests

18a612c

mhordynski reviewed May 9, 2024

View reviewed changes

Force json output

e25280c

ds-sebastianchwilczynski reviewed May 13, 2024

View reviewed changes

mhordynski closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeform SQL execution with similarity indexes #24

Freeform SQL execution with similarity indexes #24

ludwiktrammer commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024 •

edited

Loading

mhordynski May 9, 2024

ludwiktrammer May 9, 2024

ds-sebastianchwilczynski May 10, 2024

ludwiktrammer May 13, 2024

ds-sebastianchwilczynski May 10, 2024

ludwiktrammer May 13, 2024

ds-sebastianchwilczynski May 13, 2024

ds-sebastianchwilczynski May 10, 2024

ludwiktrammer May 13, 2024

ds-sebastianchwilczynski May 13, 2024 •

edited

Loading

ds-sebastianchwilczynski May 13, 2024

ds-sebastianchwilczynski May 13, 2024

ludwiktrammer May 13, 2024 •

edited

Loading

ds-sebastianchwilczynski May 14, 2024

ds-sebastianchwilczynski May 10, 2024 •

edited

Loading

ludwiktrammer May 13, 2024

ds-sebastianchwilczynski May 13, 2024 •

edited

Loading

ludwiktrammer May 14, 2024

Freeform SQL execution with similarity indexes #24

Freeform SQL execution with similarity indexes #24

Conversation

ludwiktrammer commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024 • edited Loading

Code Coverage Summary

Diff against main

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-sebastianchwilczynski May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ludwiktrammer May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-sebastianchwilczynski May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-sebastianchwilczynski May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 8, 2024 •

edited

Loading

ds-sebastianchwilczynski May 13, 2024 •

edited

Loading

ludwiktrammer May 13, 2024 •

edited

Loading

ds-sebastianchwilczynski May 10, 2024 •

edited

Loading

ds-sebastianchwilczynski May 13, 2024 •

edited

Loading