Replies: 3 comments 3 replies
-
Hi @thhapke.
I'd be happy to have a call to discuss all of this in more detail. Thanks, Andy. |
Beta Was this translation helpful? Give feedback.
-
Related to the question about Python UDFs: apache/datafusion-comet#957 |
Beta Was this translation helpful? Give feedback.
-
Many thanks Andy, it sounds fantastic and answers my first pressing questions. These answers I am using in our coming discussions whether to adopt datafusion and/or datafusion-comet or stay with pure Spark for the time being. Most probably we need to extend comet for reading csv but the details are still unclear if parallelisation is required at all for this use case. |
Beta Was this translation helpful? Give feedback.
-
We have implemented our own object store at SAP and have recently tested DataFusion, which has delivered impressive performance, particularly when compared to Spark. We are using PySpark within SAP and have come across the Apache DataFusion-Comet initiative. Since there is no dedicated discussion forum for the Comet repository, I’m reaching out here with a few questions:
For context, 80-90% of our jobs could potentially run more efficiently on a single node, but for some tasks, distributed cluster computation is essential. It would be ideal to have a system with a decision gateway that deploys jobs in an optimized manner. With DataFusion, we hope to implement such a gateway, as Spark often consumes excessive resources for simpler tasks.
I am looking forward to your insights and ideas.
Cheers Thorsten
Beta Was this translation helpful? Give feedback.
All reactions