Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

Open
alamb opened this issue Sep 6, 2024 · 20 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

Is your feature request related to a problem or challenge?

DataFuson is growing by almost all measures: community πŸ€— , features πŸͺΆ , and codebase size βœ… which is good πŸŽ‰ However, this growth is causing challenges such as:

  1. Lengthy review cycles (especially for new features). For example the PR for lateral subqueries took 5 weeks to review and merge
  2. PRs that are written but then not merged as they seem to be too large in scope (e.g. hugging face from @xinlifoobar , FlightSQLDriver from @ccciudatu, etc)
  3. Uncertainty on feature scope -- for example, should we be adding all the (very cool) DuckDB SQL extensions / aggregates to make the default SQL engine as easy as possible or should those be implemented extension packages?

As described in the Design Goals, it is important for DataFusion to:

  1. Work β€œout of the box”: Provide a very fast, world class query engine with minimal setup or required configuration.
  2. Customizable everything: All behavior should be customizable by implementing traits.

However, this description doesn't offer any specific criteria about which features should be in the core (to work "out of the box") and which should be implemented as extensions

I am worried that if we take all possiblely useful features, the DataFusion core will become unmanageble / unmaintainable. Already we are struggling with review capacity (it takes days / weeks to review new feautre PRs)

Describe the solution you'd like

I would like a clearly articulated set of criteria of when features should be added to the core vs when they should be in downstream projects / crates built with the extension APIs

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added enhancement New feature or request documentation Improvements or additions to documentation labels Sep 6, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 6, 2024

It seems to me we also haven't documented anywhere the "the built in SQL dialect tries to follow postgresql semantics when possible"

@alamb
Copy link
Contributor Author

alamb commented Sep 6, 2024

Some ideas for potential criteria

  1. Bug fixes for existing features: yes
  2. Performance improvements to existing features: yes
  3. Functional improvements to existing features: Yes
  4. New functionality that is part of the "standard sql" (aka postgres dialect): Yes
  5. New APIs for extending various parts of DataFusion: Yes
  6. New functions that aren't part of the "standard sql" (aka postgres dialect): No (rationale being they can be implemented as extensions / packages)
  7. New data sources (e.g. support for ORC) No (rationale being they can be implemented using extensions easily)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 6, 2024

Uncertainty on feature scope -- for example, should we be adding https://github.com/apache/datafusion/issues/12254to make the default SQL engine as easy as possible or should those be implemented extension packages?

Great to start the discussion of this. I have questions about this too. I think we don't really need to have all the functions in datafusion. I think core functions like count, and sum without doubt should have one in the datafusion core. The functions that require Trait implementation are also reasonable to keep them with, to showcase that the Trait is extensible enough. Others are nice to have but they could be implemented by downstream projects themselves. At least we should focus on more important things such as performance and extensibility for now. See Roadmap #11442

@cisaacson
Copy link
Contributor

@alamb I fully agree with your recommendation. It maintains the power of DataFusion while avoiding too much complexity. In my mind (and I think the project), DataFusion is first and foremost an extensible query engine, so that many new things can be implemented as a result. That purpose means the core features should be limited to those things that enable extensibility, rather than trying to bundle it all into DataFusion itself.

@findepi
Copy link
Member

findepi commented Sep 6, 2024

thanks for starting this discussion @alamb. this was a very clearly missing clarification. let's get is discussed!

DataFusion is first and foremost an extensible query engine, so that many new things can be implemented as a result.

@cisaacson very good point!
this might be what brings most of the maintainers to the project and funds their time.

Work β€œout of the box”: Provide a very fast, world class query engine with minimal setup or required configuration.

This implies reasonable collection of functions to be bundled to make the engine useful for the end-user.

For people building things on top of DataFusion, core performance & extensibility are must-haves. If we want DataFusion to be "just" extensible query engine, we can stop here.

For users, functionality (broadly speaking) and out of the box experience are must-haves.
If we want to have users, we need to understand what they want, what will make them happy.
Users may interact with DF via our CLI, our next-gen CLI #11979 or projects that build on top of DF (but not necessarily by repackaging it) like Ibis (https://ibis-project.org/backends/datafusion).
My personal opinion is that we want users & this will help this project grow and indirectly attract more people building on top of it.

Performance improvements to existing features: yes

This is an obviously good call. In practice we will face trade-offs like first-row latency vs throughput, that depends on the intended typical use-case. Hopefully this is not too often.

@2010YOUY01
Copy link
Contributor

Some ideas for potential criteria

  1. Bug fixes for existing features: yes
  2. Performance improvements to existing features: yes
  3. Functional improvements to existing features: Yes
  4. New functionality that is part of the "standard sql" (aka postgres dialect): Yes
  5. New APIs for extending various parts of DataFusion: Yes
  6. New functions that aren't part of the "standard sql" (aka postgres dialect): No (rationale being they can be implemented as extensions / packages)
  7. New data sources (e.g. support for ORC) No (rationale being they can be implemented using extensions easily)

I think only maintain SQL features which are also built-in features in PostgreSQL is a good idea (also a very clear criteria)
Postgres should have a smaller feature set than DuckDB/SparkSQL, though many DuckDB functions are really good for UX, but the larger size make it hard to maintain in DataFusion core

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 7, 2024

I think the reason why DuckDB is also taken into consideration is that when we start the array function, we found that OLAP style db is a much more suitable choice to follow than Postgres. #6855

Therefore, I'm not sure if we only stick with Postgres Only is a good idea. We might need to discuss it case by case.

@alamb
Copy link
Contributor Author

alamb commented Sep 7, 2024

I think the reason why DuckDB is also taken into consideration is that when we start the array function, we found that OLAP style db is a much more suitable choice to follow than Postgres. #6855

I agree -- this is a good point. There are certain feature sets (VARIANT is another more recent one that comes to mind) where there really isn't a good postgres feature set to follow, and if we want to add such features into DataFusion then we would need to find another standard.

Therefore, I'm not sure if we only stick with Postgres Only is a good idea. We might need to discuss it case by case.

I think it would be an excellent idea to make sure we don't end up with some "hard and fast rule that must always be followed" -- ensuring we can continue to evaluate each idea on a case by case basis is a great point. Maybe in these cases the "bar" is higher (like a good amount of the community thinks it is an important and widely applicable feature πŸ€” )

@jonahgao
Copy link
Member

jonahgao commented Sep 7, 2024

I agree with keeping the DataFusion core simple and focused.

I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need. These extensions display different properties based on different types, such as TableProvider or UDF. We may need to do some work to make integrating extensions into DataFusion easier.

@phillipleblanc
Copy link
Contributor

phillipleblanc commented Sep 7, 2024

I like the analogy that DataFusion is to query engines what LLVM is to programming languages. (I think I heard Andrew say that once?) Although the analogy isn't perfect, because you can use DataFusion out of the box for a great SQL query experience whereas LLVM (to my knowledge) requires writing a non-trivial amount of code to integrate with it.

Actually I think its because DataFusion has such a great out of box experience, that people want to naturally add to it to make it even better.

For people building things on top of DataFusion, core performance & extensibility are must-haves. If we want DataFusion to be "just" extensible query engine, we can stop here.

For users, functionality (broadly speaking) and out of the box experience are must-haves.
If we want to have users, we need to understand what they want, what will make them happy.

This is an important distinction, and where we need to decide if we want to be more like LLVM (i.e. focus on people building things on top of DataFusion) or something that attracts users directly. I don't think that those are mutally exclusive (i.e. most users probably are people building on top of DataFusion) - but I do think it makes sense to focus more on the core part of what makes DataFusion great as mentioned above.

I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need. These extensions display different properties based on different types, such as TableProvider or UDF. We may need to do some work to make integrating extensions into DataFusion easier.

Yes, I think part of the solution here is to make it very easy to discover extensions that add to the base DataFusion functionality. I think part of why its tempting to add new features to DataFusion core is that it makes it more discoverable by default/provides a natural coordination point for implementing a set of functionality.

As a concrete example, when I was first integrating with DataFusion for my project I needed the ability to translate DataFusion expressions back into raw SQL strings to implement a TableProvider. I found the unparser module that implements this functionality by looking through the code in the DataFusion repo. Had I not found it, I probably would have gone on to implement it myself. But I could make the argument that the unparser module doesn't really need to be in DataFusion core, it could just be an extension or in its own crate. (Actually in that case, the unparser was already in a separate crate initially then it was brought into the core.) Having a natural way to discover functionality like the unparser without having to be in the core repo is important to solve - otherwise there will always be an incentive to try to get it into the core. Or we end up with multiple people implementing the same thing instead of working on it together - which almost happened in the unparser example.

@alamb
Copy link
Contributor Author

alamb commented Sep 10, 2024

This is an important distinction, and where we need to decide if we want to be more like LLVM (i.e. focus on people building things on top of DataFusion) or something that attracts users directly. I don't think that those are mutally exclusive (i.e. most users probably are people building on top of DataFusion) - but I do think it makes sense to focus more on the core part of what makes DataFusion great as mentioned above.

I agree -- this idea is somewhat mentioned in https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz as well:

Targeted at developers, rather than end users / data scientists.

Yes, I think part of the solution here is to make it very easy to discover extensions that add to the base DataFusion functionality. I think part of why its tempting to add new features to DataFusion core is that it makes it more discoverable by default/provides a natural coordination point for implementing a set of functionality.

I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need.

I agree with @phillipleblanc and @jonahgao -- here is a proposal to try and make it easier to discover extensions:

It isn't quite as easy as VSCode marketplace (or the newly announced DuckDB community extensions: https://community-extensions.duckdb.org/) but it is a start.

I also very much hope that the https://github.com/datafusion-contrib/datafusion-tui project @matthewmturner and I are working on will become an example / easy to start from place for pre-cooked integrations which will help with discoverability. We still have a ways to go but I am feeling bullish.

@ozankabak
Copy link
Contributor

ozankabak commented Sep 11, 2024

Thank you for starting this discussion. I really agree with this concise statement:

keeping the DataFusion core simple and focused.

When we first joined the project (almost two years ago now), it took us some time to internalize/digest this approach as our first instinct was to contribute as much as we can upstream. However, I can safely say that following this guideline helped us with our engineering too -- it forces one to think about the right boundaries between components, what belongs to the core, etc.

@kszlim
Copy link

kszlim commented Sep 12, 2024

One vote here for the other use case. I'd like datafusion to be usable as a single node query engine (alongside a nice dataframe api). This is in works within the datafusion-python bindings, but I'd personally love for this use case to gain as much priority as datafusion as a library to build other db products on top of.

I really think with a combination of really strong python bindings (and ensuring that all extension points are also appropriately exposed to python), #4285, and a lot of work into making the docs and the python bindings as nice as polars. Datafusion could become the go to solution for ETL/OLAP/ML/data engineering/etc. use cases.

DataFusion has a lot of really excellent foundational engineering. How it's used by so many downstream DB engines attests strongly to that. I think it's a real shame that it isn't quite as suitable for the role that pandas/dask/polars/duckdb currently occupies. This isn't due to anything lacking in the query engine, but the overall user experience for a direct user isn't quite as solid (as opposed to someone using it as a library).

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2024

DataFusion has a lot of really excellent foundational engineering. How it's used by so many downstream DB engines attests strongly to that. I think it's a real shame that it isn't quite as suitable for the role that pandas/dask/polars/duckdb currently occupies. This isn't due to anything lacking in the query engine, but the overall user experience for a direct user isn't quite as solid (as opposed to someone using it as a library).

Thank you @kszlim -- This is well stated, and I think this is one of the core tensions that has existed in the project from the early days

One way to go is as you suggest and try and make datafusion the superset of all that is good about polars (python dataframes) and duckdb (sql). I worry that this will result in an even larger library that will never be good as either.

Another potential way is to keep the core focused on fundamentals and work to provide open source alternatives to those other libraries built on datafusion. It is my not-so-secret goal with the following discussions:

I am hopeing to see datafusion-python (or maybe a library built on datafusion-python) and dft evolve into delightful end user experiences.

The benefit if keeping the core more focused is that it would make it easier to embed and have more usecases, thus drawing more users and thus contributors back.

@timsaucer
Copy link
Contributor

I really think with a combination of really strong python bindings (and ensuring that all extension points are also appropriately exposed to python), #4285, and a lot of work into making the docs and the python bindings as nice as polars.

I feel like we've made a ton of progress on this in datafusion-python 40 and 41. As someone who is also using datafusion-python in my project, I can already feel the huge usability improvements that make my day to day work more enjoyable. Now, I'm probably biased since I am focusing on building those as I need them for my projects. But the type hinting, simpler apis, html rendering in notebooks, and rust udfs in python all have made a really different experience from when I first started to use it.

The point I'm still struggling with right now is the extension points and how those can/should fit into the python bindings. There are some parts that are trivially easy to do and some parts that are not supported. I should probably open an issue to find out what all of the extensions people would like to see in the python bindings.

That's a bit of an aside from the central discussion here. My thoughts on the core question is much in line with what @alamb suggests above about supporting core features and a minimal set of extensions to demonstrate the usability.

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2024

I feel like we've made a ton of progress on this in datafusion-python 40 and 41. As someone who is also using datafusion-python in my project, I can already feel the huge usability improvements that make my day to day work more enjoyable. Now, I'm probably biased since I am focusing on building those as I need them for my projects. But the type hinting, simpler apis, html rendering in notebooks, and rust udfs in python all have made a really different experience from when I first started to use it.

I think all great software is created by someone who is in some way building it for themselves and has an intuitive understanding of what is needed. I am very glad you have started to help craft datafusion-python this way

@Omega359
Copy link
Contributor

One vote here for the other use case. I'd like datafusion to be usable as a single node query engine (alongside a nice dataframe api).

This is my use case - datafusion is an embedded query engine which I use via it's dataframe api. I have a very small set of changes that I've made to datafusion in a branch but for the most part I use it as it is.

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

FYI we created https://github.com/datafusion-contrib/datafusion-functions-extra as a home for extra functions to try and organize our efforts to make new functions outside the core of datafusion

See #12254 (comment) for more details

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2024

In case it isn't obvious, one of my goals with encouraging / setting up other repositories is to provide an outlet for contributions that isn't the datafusion core

I don't want the answer to be "no we don't want them" -- I just think the answer can't be "put them in the datafusion core" for everything (mostly to keep the maintenance of the project manageable)

@cisaacson
Copy link
Contributor

@alamb This is a great way to do this, it allows the core of DataFusion to keep its focus. I support this approach, and it allows other tools to be added that rely on DataFusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests