Skip to content

Separate Spark-compatibility expressions into a library #630

Closed
@Blizzara

Description

@Blizzara

What is the problem the feature request solves?

Hey, we're looking to use DataFusion to replace some Spark workflows, but in a somewhat different way than Comet - we convert Spark logical plans into Substrait and that into DataFusion. Our goal is still similar to Comet: to get equivalent-but-faster execution compared to Spark.

We've noticed some differences in the Spark and DataFusion native expressions, which you guys have already addressed in Comet by adding custom expressions that match closer to Spark's, and so we'd like to be able to reuse the expressions from Comet where possible.

While that works today in theory, in practice it's a bit difficult given dependency issues between Comet's version of DataFusion and the version we use (just later commit on datafusion main atm, but still), and also pulling all of Comet for just the expressions is a bit unnecessary. So my question is - would you be open to separating the expressions into e.g. datafusion-comet-exprs crate, which could have a reduced set of dependencies and so be easier to reuse downstream? I'd be happy to write a PR if you'd find it acceptable.

I might need to also change some functions to be exported from that crate to reuse them, as Comet operates mainly on the physical expr level but for our use it's easier to have the expressions as ScalarUDFs, if that's okay.

I tested already through a fork that reusing the Comet expressions is possible, and it does give better compatibility with Spark :)

Describe the potential solution

Rename core into datafusion-comet/core
Move core/src/execution/datafusion/expressions into datafusion-comet/expr
(probably some more changes to fix cross-dependencies, maybe introduce datafusion-comet/common to share stuff if needed)
Open for other suggestions as well!

Additional context

Related: apache/datafusion#11201

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions