Skip to content

Plan AsyncFuncExpr by physical planner #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
goldmedal opened this issue Feb 19, 2025 · 3 comments
Closed

Plan AsyncFuncExpr by physical planner #6

goldmedal opened this issue Feb 19, 2025 · 3 comments

Comments

@goldmedal
Copy link
Owner

Description

There are some reasons why I want to implement in the logical plan level

Go through all the expression

When working on #4, I noticed that it's not so convenient to visit all the expressions(To check if it's an AsyncFuncExpr) of the physical plan. In the logical plan phase, we can use something like LogicalPlan::map_expression to go through all expressions of the plan.

By the way, I considered implementing a similar tree node method for the physical plan. However, the ExecutionPlan isn't an ENUM. It's hard to maintain this API if anyone adds a new ExecutionPlan. 🤔

Discouple with the optimization rule

In my opinion, the optimization phase isn't required for SQL execution. A physical plan should be executable even if we don't apply the optimization rule. The physical planner can ensure that a logical AsyncExecute is planned to AsyncFuncExec, then apply the optimization for the batch coalesce if necessary.

On the other hand, the ordering of the optimization is important. If we do the planning thing in the optimization phase, I guess it may break some optimization effects.

Keep the compatibility for the federation scenario

datafusion-federation attempts to unparse the logical plan into SQL and push it down to the external database. If we can't identify async scalar functions in the logical plan phase, we may generate incorrect SQL. Conversely, if async scalar functions can be recognized during logical planning or unparsing, we can push down only valid plans to the data source and apply the async scalar function to the results from the external database.

Consider the following case: (the pg_items is provided by an external Postgres)

SELECT t.name, llm_bool('Is {t.price} reasonable?', t.price) as reasonable FROM pg_items t WHERE create_date > '2024-01-01'

It will be planned to

Projection t.name, reasonable
  AsyncExecute llm_bool('Is {t.price} reasonable?', t.price) as reasonable
    Filter create_date > '2024-01-01'
      TableScan pg_items

If we apply the concept of datafusion-federation, we can get the plan like

Projection t.name, reasonable
  AsyncExecute llm_bool('Is {t.price} reasonable?', t.price) as reasonable
      Scan( SELECT name, price FROM pg_items WHERE create_date > '2024-01-01' ) as t   // pushdown the scan and filter to Postgres

What I may implement

  • Add a logical plan for AsyncExec
  • Plan AsyncExec by the logical planner. (maybe analyzer rule)
  • Plan logical AsyncExec to the physical AsyncExec. (maybe in the physical planner)

If we plan the logical plan mentioned above to the physical plan:

ProjectExec
   AsyncFuncExec
      DataSourceExec // for the postgres source

Then, apply the optimization rule

ProjectExec
   AsyncFuncExec
      ColeaseBatchExec
          DataSourceExec // for the postgres source
@goldmedal goldmedal changed the title Plan AsyncFuncExpr by logical_planner Plan AsyncFuncExpr in the logical level Feb 19, 2025
@goldmedal
Copy link
Owner Author

Hi @alamb, what do you think about this concept? I also wonder if it makes sense to implement it directly in DataFusion. I think it can solved apache/datafusion#6518

@goldmedal goldmedal changed the title Plan AsyncFuncExpr in the logical level Plan AsyncFuncExpr by physical planner Feb 22, 2025
@goldmedal
Copy link
Owner Author

goldmedal commented Feb 22, 2025

I found that presenting the async function exec as a logical plan doesn't make sense. The logical plan should describe what we need to do (executing a scalar function), not how we do it (executing the scalar function asynchronously).

Even if we don't have a logical plan for it, we can still recognize it when planning federation queries.

I close this issue but I'll try to port the async function exec to the datafusion core.

@alamb
Copy link
Contributor

alamb commented Feb 23, 2025

Thanks @goldmedal -- sorry for my late response. I have been on vacation this past week.

I think porting the async functions idea to the datafusion core would likely be useful for many others as well

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants