Support for SQLContext #999

ceyhunkerti · 2024-10-02T13:09:08Z

This is a draft PR seeking further guidance on supporting the SQLContext interface of Polars (and potentially for similar APIs from other backends in the future).
The goal is to enable SQL usage, which I believe could help to address issues like this one.
I've also added a sql_context_test.exs file to demonstrate the usage pattern.

Request for Feedback

@billylanchantin, @josevalim, @philss, @anyone_maintaining_the_repo :). I would greatly appreciate it if you could review this and provide feedback, pointing me in the right direction for any improvements.

Notes

This is still a work in progress. Some parameters, options, and documentation are missing and will be added in future commits.
I'm relatively new to both languages but am being diligent in my research and careful with the implementation.

Thank you for your time and guidance!

billylanchantin · 2024-10-02T14:22:06Z

Hi @ceyhunkerti!

Quick question before I dig into the code. You say:

The goal is to enable SQL usage, which I believe could help to address issues #969.

Would Explorer.DataFrame.sql/3 be sufficient for this?

ceyhunkerti · 2024-10-02T14:25:56Z

Hey, thanks for the quick feedback,
I saw it, but I believe it's restricted to a single df right?.
SQLContext operates on any number of dfs. as if they are tables.

billylanchantin · 2024-10-02T14:54:37Z

Ah I see. A function that only works on one frame certainly won't enable multi-frame joins! Thanks for clarifying the goal.

I'll take a look soon. Note: we were/are hesitant about supporting Polars-specific SQL interfaces. You can read the discussion here:

#818

We went with DF.sql/3 because it was backend-agnostic.

ceyhunkerti · 2024-10-02T15:34:21Z

hmm, thinking loudly 🤔

to understand backend independence.

We say DF.sql/3 is backend independent.
When we call it using polars backend and get a df I believe it stores a polars data frame internally.
So how do we use this df in another backend if we want' to mix backends in the same script like ;

# pseudo
df1 = DF.sql("select ...", backend: polars) |> compute
df2 = DF.some_action(df1, backend: my_new_backend)

I believe the above one is not possible unless my_new_backend supports interacting with polars dataframes right?
So in that manner DF.sql or any other function interacting with df is not actually backend independent in mixing.

I think by backend independence we just imply the api independence without mixing the backends.

If we are worrying about the SQL itself, I partially agree, for example;
select ... from t1, t2 not supported on polars-sql but works with duckdb
But I believe it's a reasonable trade off when developer knows about the backends they use.
(and imho: for any backend claiming having a SQL backend should at least support ANSI SQL)

PS. When thinking of another backend, first one comes to my mind is duckdb which can operate on polars dataframes.
So in that manner the SQLContext on the PR can be also easily implmented in duckdb backend.

btw, I do these for learning purposes and am happy to support if I can. We can park this one for now if you'd like, and I can focus on something else that you think might be more helpful or a higher priority.

josevalim · 2024-10-02T17:21:31Z

What if we allow DF.sql(df, "select", tables: [foo: df2, bar: df3])?

josevalim · 2024-10-02T17:22:32Z

Or this could be even better:

DF.sql([foo: df1, bar: df2, baz: df3], "select")?

billylanchantin · 2024-10-02T17:24:25Z

@josevalim You read my mind! Perhaps even:

DF.sql(%{foo: df1, bar: df2, baz: df3}, "select")

to prevent duplicates?

Also I think behind the scenes we'd need to individually register each table with SQLContext.

josevalim · 2024-10-02T17:30:33Z

If we use a map, we don't have the concept of "first".

ceyhunkerti · 2024-10-02T18:57:02Z

@josevalim You read my mind! Perhaps even:
DF.sql(%{foo: df1, bar: df2, baz: df3}, "select")
to prevent duplicates?

Also I think behind the scenes we'd need to individually register each table with SQLContext.

Also mentioned something similar to it in here lf_sql (sql: string, registry: [(df, table_name), ...]) map is better though like you suggested.

Back to the original issue. DF.sql vs SQLContext

SQLContext is a wider and more flexible interface I think, also our current DF.sql uses it on the polars, side. But it's your call we can do the way you decide. I can only comment about the order of the parameters if we go with DF.sql(%{foo: df1, bar: df2, baz: df3}, "select") :)

DF.sql("select", %{})

seems more natural to me like print("Hello %s" % param). :)

ceyhunkerti · 2024-10-02T19:08:58Z

Also, this referenced issue here
makes me think. those;

In the issue author mentioned in order to achieve complex and/or logic one can do the following with the api

result =
  all_pairs
  |> DataFrame.join(all_relationships_1, how: :left,
      on: [
        {"table_name_1", "table_name"}, 
        {"column_name_1", "column_name"}, 
        {"table_name_2", "referenced_table_name"}, 
        {"column_name_2", "referenced_column_name"}
      ])
  |> DataFrame.join(all_relationships_2, how: :left,
      on: [
        {"table_name_2", "table_name"}, 
        {"column_name_2", "column_name"}, 
        {"table_name_1", "referenced_table_name"}, 
        {"column_name_1", "referenced_column_name"}
      ])
  |> DataFrame.mutate(has_relationship: coalesce(has_relationship_1, has_relationship_2))

So not exactly sure what execution plan polars creates for it but it makes me wonder if it duplicates the dataframe on memory does some hash like join just to achieve a slightly complex and/or cond and have a probable performace degradation. It maybe smarter than I think but this makes me suspicious.

At the end, this case, IMHO, justifies having a powerful sql interface in the library in way or the other.

ceyhunkerti added 2 commits October 2, 2024 15:47

sql_context init

f0b2547

rust-return stmts

555ef3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for SQLContext #999

Support for SQLContext #999

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

billylanchantin commented Oct 2, 2024

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

billylanchantin commented Oct 2, 2024

ceyhunkerti commented Oct 2, 2024

josevalim commented Oct 2, 2024

josevalim commented Oct 2, 2024

billylanchantin commented Oct 2, 2024

josevalim commented Oct 2, 2024

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

Support for SQLContext #999

Are you sure you want to change the base?

Support for SQLContext #999

Conversation

ceyhunkerti commented Oct 2, 2024 • edited Loading

Request for Feedback

Notes

billylanchantin commented Oct 2, 2024

ceyhunkerti commented Oct 2, 2024 • edited Loading

billylanchantin commented Oct 2, 2024

ceyhunkerti commented Oct 2, 2024

josevalim commented Oct 2, 2024

josevalim commented Oct 2, 2024

billylanchantin commented Oct 2, 2024

josevalim commented Oct 2, 2024

ceyhunkerti commented Oct 2, 2024 • edited Loading

ceyhunkerti commented Oct 2, 2024 • edited Loading

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

ceyhunkerti commented Oct 2, 2024 •

edited

Loading

ceyhunkerti commented Oct 2, 2024 •

edited

Loading