Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] We need a Hero for datafusion-python #440

Open
alamb opened this issue Jul 26, 2023 · 17 comments · Fixed by #666
Open

[DISCUSSION] We need a Hero for datafusion-python #440

alamb opened this issue Jul 26, 2023 · 17 comments · Fixed by #666

Comments

@alamb
Copy link

alamb commented Jul 26, 2023

What this project could be

I think this project needs someone who wants to make a world class python dataframe library and user experience take the helm. I will argue why I think this is a compelling opportunity to make a great piece of technology and have a wide impact across the data analytic space:

What this project could be

I think this project could be one of the most widely used data analysis libraries out there. Imagine a system that allows BOTH a fast dataframe API (ala pol.rs) but also first class SQL support (ala duckdb) that are both screaming fast (due to all the effort that goes into https://github.com/apache/arrow-datafusion) as well as easy to plug into the eco system (arrow / parquet) and extensible (UDFS, UDAs, etc)

DataFusion already posts great benchmark numbers, and I will post datafusion 28.0.0 benchmark when we have them.

How is this different than the mission of DataFusion?

DataFusion is a great project but is currently focused on building the core analytic engine:

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

image

This repository contains basic python bindings, but the user experience (UX) could be improved in so many ways.

The opportunity

This would be a great opportunity for someone to:

  1. Build some really cool technology
  2. Learn how to help grow an open source project and community with help and guidance from the rest of the DataFusion community
  3. Learn about analytic database technology, Arrow, etc
  4. Influence the direction of Development in DataFusion
@mesejo
Copy link
Contributor

mesejo commented Jul 26, 2023

I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.

@cpcloud
Copy link
Contributor

cpcloud commented Jul 27, 2023

This looks like a great idea!

I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.

As @mesejo mentioned, they've been making great contributions to the DataFusion backend so there's currently some momentum that we can take advantage of.

I'll be up front about it: there's still a lot of work to do, the DataFusion backend is missing a lot of functionality.

The good news is that we've made really easy to see what functionality is missing from any given backend using our backend support matrix app.

Anyone can take a pass at implementing the operations that have a 🚫 in the datafusion column. Some operations will be more challenging than others, and the ibis maintainers (@kszucs, @gforsyth, @jcrist and myself) are here to help.

What do say we ... COALESCE 😉 around ibis as the DataFrame API for DataFusion?

@alamb
Copy link
Author

alamb commented Jul 28, 2023

I propose we leave the the decision of where to take this project and what to focus on to whatever hero(s) step forward. What I think datafusion-python needs is someone to invest the time to drive it forward, and the path to take, as in all open source projects, would be largely influenced by the contributors.

I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.

Thank you @cpcloud -- this is an excellent idea and it would be awesome to see the DataFusion ibis backend become more full featured.

I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.

What do say we ... COALESCE 😉 around ibis as the DataFrame API for DataFusion?

That is one of cleverest summaries I have seen in a long time. Nicely done 👏

@alamb
Copy link
Author

alamb commented Jul 28, 2023

I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.

Thank you @mesejo -- that is great. Like many projects, I think what would be most valuable in this project is

  1. Reviewing PRs and encouraging more involvment
  2. Ensuring the project is easy to both use and contribute such as New users guide #432

Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.

@lostmygithubaccount
Copy link

just to throw out an idea related to this:

I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.

if we agree Ibis is a delightful dataframe API and we can close the gaps in the DataFusion backend, then you could avoid a lot of work in defining a new dataframe API by wrapping Ibis so that code looks like:

[ins] In [3]: t = datafusion.read_parquet("penguins.parquet")

[ins] In [4]: t
Out[4]:
DatabaseTable: _ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse
  species           string
  island            string
  bill_length_mm    float64
  bill_depth_mm     float64
  flipper_length_mm int64
  body_mass_g       int64
  sex               string
  year              int64

[ins] In [5]: datafusion.options.interactive = True

[ins] In [6]: t
Out[6]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ AdelieTorgersen39.118.71813750male2007 │
│ AdelieTorgersen39.517.41863800female2007 │
│ AdelieTorgersen40.318.01953250female2007 │
│ AdelieTorgersennannanNULLNULLNULL2007 │
│ AdelieTorgersen36.719.31933450female2007 │
│ AdelieTorgersen39.320.61903650male2007 │
│ AdelieTorgersen38.917.81813625female2007 │
│ AdelieTorgersen39.219.61954675male2007 │
│ AdelieTorgersen34.118.11933475NULL2007 │
│ AdelieTorgersen42.020.21904250NULL2007 │
│ …       │ …         │              … │             … │                 … │           … │ …      │     … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

[ins] In [7]: t.group_by(["species", "island"]).agg(datafusion._.count())
Out[7]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ speciesislandCountStar(_ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64                                                    │
├───────────┼───────────┼──────────────────────────────────────────────────────────┤
│ AdelieBiscoe44 │
│ AdelieTorgersen52 │
│ AdelieDream56 │
│ ChinstrapDream68 │
│ GentooBiscoe124 │
└───────────┴───────────┴──────────────────────────────────────────────────────────┘

[ins] In [8]: t.group_by(["species", "island"]).agg(datafusion._.count().name("count"))
Out[8]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ speciesislandcount ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ AdelieBiscoe44 │
│ AdelieTorgersen52 │
│ ChinstrapDream68 │
│ GentooBiscoe124 │
│ AdelieDream56 │
└───────────┴───────────┴───────┘

@mesejo
Copy link
Contributor

mesejo commented Aug 6, 2023

@alamb Thanks for the feedback

Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.

I open a PR with a draft for the User Guide 😃. While I was writing the guide, I noticed two issues that have a huge impact on the UX and are simple to solve:

  1. The IDE cannot provide hints (or autocompletion) because there is no typing information.
  2. There are no examples of how to use each method (or function)

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

What are your thoughts?

@alamb
Copy link
Author

alamb commented Aug 7, 2023

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

That seems like a very good idea to me

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

I think this is likewise a great idea

Thank you @mesejo

@kylebarron
Copy link
Contributor

kylebarron commented Aug 7, 2023

Just a note that with manual .pyi files you have the endless problem of ensuring that the .pyi files and code match up correctly. Wrapping every rust function in a pure-python function works as polars does but also incurs a ton of overhead (edit: development overhead, not runtime performance overhead). The long-term solution is if pyo3 can emit python type files automatically, as wasm-bindgen does for TypeScript, but that's likely far off

@devinjdangelo
Copy link

devinjdangelo commented Aug 18, 2023

I’m late to this discussion (and new to this project in general), but the contributions I’ve been focused on over the past month or so have been aimed at solving some of the gaps I see as a heavy python user with a Data Science / Engineering background. Particularly for ETL usecases it needs to be easy to move and transform data between various formats and Object stores leveraging every core available to the maximum extent possible. Default options need to be well tuned since most of these users imo won’t give DataFusion a second look if they run their job and it is much slower than polars or XYZ tool they use currently.

I haven’t gotten to actually looking much at the python interface yet, but it is on my list.

I am very much on board with the vision you describe @alamb.

@magarick
Copy link

Hi everyone. I'm happy to help out with this. I think it might be a good idea to get a sense for what people think this should ultimately look like as well as what features they think a good DataFrame library should have. To that end, I've started this issue which hopefully will help gather ideas and fodder for documentation #462

@mesejo
Copy link
Contributor

mesejo commented Sep 6, 2023

Folks! I've created some issues to tackle the missing functions in the Python bindings.

These are a perfect fit for a good first issue, so contributions are more than welcome. (@alamb perhaps we could label the issues as such and promote them on Twitter to increase the involvement with the project?)

@alamb
Copy link
Author

alamb commented Sep 7, 2023

Thanks @mesejo -- I marked the tickets as good first issue and posted a tweet: https://twitter.com/andrewlamb1111/status/1699827809462440353

@woxiaosa
Copy link

woxiaosa commented Jan 4, 2024

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

That seems like a very good idea to me

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

I think this is likewise a great idea

Thank you @mesejo

Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work

@alamb
Copy link
Author

alamb commented Jan 5, 2024

Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work

Thanks @woxiaosa -- I do not know of anyone currently adding pyi files

@dlr2
Copy link

dlr2 commented Jan 31, 2024

I am also late to this, but as I am trying to evaluate datafusion (Python) I can give some of my input. I am sure that folks know that the documentation is scant, with most API functions having no more than the method name and args (auto extracted from sphinx).

My next idea was to test the SQL vs native expression filtering. Got the SQL to work, but I cannot see how to use an 'and' expr/function. As this is a reserved word I saw no way to apply it or import it. Would the native expr filtering be faster than the equivalent SQL?

So yes, complete examples (showing all the imports, etc) for all the functions and expressions would be great. Hope this is useful

@alamb
Copy link
Author

alamb commented May 13, 2024

I think github got a little excited about closing this

@lostmygithubaccount
Copy link

This PR does not close #440 but it helps to address one part of it.

somebody at GitHub is going to use this as evidence for LLM-based issue closing instead of the current rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants