-
Notifications
You must be signed in to change notification settings - Fork 203
Separate Spark-compatibility expressions into a library #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for filing this @Blizzara. I have been meaning to suggest this as well. I think it would be very valuable to provide a crate with spark-compatible expressions but without any JVM/Spark dependencies. |
I created #637 so that we can start splitting code out into separate crates. I also started looking at what would be involved in creating a |
Nice, glad to hear this resonates and thanks!
Hmm, would it make sense to use just DataFusionError for those? I think for a 3rd party user (like me) it would be fine, and just easier if I only need to handle one type of error coming from DF + these expressions, but dunno if Comet internally needs something else from the errors. |
Thanks for raising this issue and Andy's quick response on this. I was thinking about adding a similar crate to DataFusion as well, which could be expanded to support other query engines, such as a presto-compatible crate. Of course it makes a lot of sense to add such crate in comet first, which could be iterated quickly. |
Yes, I ended up using DataFusionError but wrapping a Spark-specific error. For example: Err(DataFusionError::External(Box::new(
SparkError::ArithmeticOverflow(self.data_type_name.clone()),
))) |
I think we can close this issue because the new crate has been created. I filed #659 for moving the remainder of the scalar functions over. |
What is the problem the feature request solves?
Hey, we're looking to use DataFusion to replace some Spark workflows, but in a somewhat different way than Comet - we convert Spark logical plans into Substrait and that into DataFusion. Our goal is still similar to Comet: to get equivalent-but-faster execution compared to Spark.
We've noticed some differences in the Spark and DataFusion native expressions, which you guys have already addressed in Comet by adding custom expressions that match closer to Spark's, and so we'd like to be able to reuse the expressions from Comet where possible.
While that works today in theory, in practice it's a bit difficult given dependency issues between Comet's version of DataFusion and the version we use (just later commit on datafusion main atm, but still), and also pulling all of Comet for just the expressions is a bit unnecessary. So my question is - would you be open to separating the expressions into e.g.
datafusion-comet-exprs
crate, which could have a reduced set of dependencies and so be easier to reuse downstream? I'd be happy to write a PR if you'd find it acceptable.I might need to also change some functions to be exported from that crate to reuse them, as Comet operates mainly on the physical expr level but for our use it's easier to have the expressions as ScalarUDFs, if that's okay.
I tested already through a fork that reusing the Comet expressions is possible, and it does give better compatibility with Spark :)
Describe the potential solution
Rename
core
intodatafusion-comet/core
Move
core/src/execution/datafusion/expressions
intodatafusion-comet/expr
(probably some more changes to fix cross-dependencies, maybe introduce
datafusion-comet/common
to share stuff if needed)Open for other suggestions as well!
Additional context
Related: apache/datafusion#11201
The text was updated successfully, but these errors were encountered: