Skip to content

Separate Spark-compatibility expressions into a library #630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Blizzara opened this issue Jul 5, 2024 · 6 comments
Closed

Separate Spark-compatibility expressions into a library #630

Blizzara opened this issue Jul 5, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@Blizzara
Copy link
Contributor

Blizzara commented Jul 5, 2024

What is the problem the feature request solves?

Hey, we're looking to use DataFusion to replace some Spark workflows, but in a somewhat different way than Comet - we convert Spark logical plans into Substrait and that into DataFusion. Our goal is still similar to Comet: to get equivalent-but-faster execution compared to Spark.

We've noticed some differences in the Spark and DataFusion native expressions, which you guys have already addressed in Comet by adding custom expressions that match closer to Spark's, and so we'd like to be able to reuse the expressions from Comet where possible.

While that works today in theory, in practice it's a bit difficult given dependency issues between Comet's version of DataFusion and the version we use (just later commit on datafusion main atm, but still), and also pulling all of Comet for just the expressions is a bit unnecessary. So my question is - would you be open to separating the expressions into e.g. datafusion-comet-exprs crate, which could have a reduced set of dependencies and so be easier to reuse downstream? I'd be happy to write a PR if you'd find it acceptable.

I might need to also change some functions to be exported from that crate to reuse them, as Comet operates mainly on the physical expr level but for our use it's easier to have the expressions as ScalarUDFs, if that's okay.

I tested already through a fork that reusing the Comet expressions is possible, and it does give better compatibility with Spark :)

Describe the potential solution

Rename core into datafusion-comet/core
Move core/src/execution/datafusion/expressions into datafusion-comet/expr
(probably some more changes to fix cross-dependencies, maybe introduce datafusion-comet/common to share stuff if needed)
Open for other suggestions as well!

Additional context

Related: apache/datafusion#11201

@Blizzara Blizzara added the enhancement New feature or request label Jul 5, 2024
@andygrove
Copy link
Member

Thank you for filing this @Blizzara. I have been meaning to suggest this as well. I think it would be very valuable to provide a crate with spark-compatible expressions but without any JVM/Spark dependencies.

@andygrove
Copy link
Member

I created #637 so that we can start splitting code out into separate crates.

I also started looking at what would be involved in creating a datafusion-comet-spark-expr crate, and the first issue I ran into is that our CometError type has dependencies on the JNI crate, and this doesn't seem ideal for pure Rust downstream projects that just want to re-use the expressions. Perhaps we'll need a separate error type that the new crate can use and that we can convert into a CometError for the Comet/Spark integration.

@Blizzara
Copy link
Contributor Author

Blizzara commented Jul 8, 2024

I created #637 so that we can start splitting code out into separate crates.

Nice, glad to hear this resonates and thanks!

I also started looking at what would be involved in creating a datafusion-comet-spark-expr crate, and the first issue I ran into is that our CometError type has dependencies on the JNI crate, and this doesn't seem ideal for pure Rust downstream projects that just want to re-use the expressions. Perhaps we'll need a separate error type that the new crate can use and that we can convert into a CometError for the Comet/Spark integration.

Hmm, would it make sense to use just DataFusionError for those? I think for a 3rd party user (like me) it would be fine, and just easier if I only need to handle one type of error coming from DF + these expressions, but dunno if Comet internally needs something else from the errors.

@advancedxy
Copy link
Contributor

Thanks for raising this issue and Andy's quick response on this.

I was thinking about adding a similar crate to DataFusion as well, which could be expanded to support other query engines, such as a presto-compatible crate. Of course it makes a lot of sense to add such crate in comet first, which could be iterated quickly.

@andygrove
Copy link
Member

Hmm, would it make sense to use just DataFusionError for those? I think for a 3rd party user (like me) it would be fine, and just easier if I only need to handle one type of error coming from DF + these expressions, but dunno if Comet internally needs something else from the errors.

Yes, I ended up using DataFusionError but wrapping a Spark-specific error. For example:

Err(DataFusionError::External(Box::new(
    SparkError::ArithmeticOverflow(self.data_type_name.clone()),
)))

@andygrove
Copy link
Member

I think we can close this issue because the new crate has been created. I filed #659 for moving the remainder of the scalar functions over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants