-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add "lazy-only" level of support #566
Comments
I'm a fan of Ibis and use it regularly, mainly together with https://github.com/binste/dbt-ibis and I find it exciting to see that you're looking at it as a potential layer around duckdb. However, given that one of the benefits of narwhals is a low overhead, I just wanted to note that all the type validation and abstractions in Ibis do add overhead which can add up. It might not be noticeable when working interactively with larger amounts of data but I had use cases in web applications where I had to rewrite code back from Ibis to pure Duckdb or Pandas due to this. Here's a quick example, loosely based on this DuckDB-Ibis example. Ibis 9.1, duckdb 1.0. import ibis
# Create some example database
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
"penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
) %%timeit
# reconnect to the persisted database (dropping temp tables)
con = ibis.connect("duckdb://penguins.ddb")
penguins = con.table("penguins")
penguins = (
penguins.filter((penguins.species == "Gentoo") & (penguins.body_mass_g > 6000))
.mutate(bill_length_cm=penguins.bill_length_mm / 10)
.select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"year",
"bill_length_cm",
)
)
penguins_arrow = penguins.to_pyarrow() 19.1 ms ± 482 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) Same in pure duckdb: %%timeit
con = duckdb.connect("penguins.ddb")
penguins = con.table("penguins")
penguins = (
penguins.filter(
(ColumnExpression("species") == ConstantExpression("Gentoo"))
& (ColumnExpression("body_mass_g") > ConstantExpression(6000))
)
.project(
*[ColumnExpression(name) for name in penguins.columns],
(ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
"bill_length_cm"
)
)
.select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"year",
"bill_length_cm",
)
)
penguins_arrow = penguins.arrow() 1.5 ms ± 9.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see ibis-project/ibis#9111, but some overhead might just be inherent. Just thought you might appreciate the note that it's worth benchmarking first what you want to achieve in case you did not yet have it on your list already anyway :) |
Hope you don't mind me commenting here, this issue was linked in our issue tracker. Just want to offer a small clarification on:
Ibis's expressions definitely have some overhead, but that overhead scales relative to the size of your query, not your data. We're doing some work to reduce it in certain cases (as you noted, some operations scaled poorly on wide tables), but for most analytics use cases the overhead (in 10-100 ms in your examples above) is negligible compared to the execution time of the query. Ibis was designed to perform well for data analytics/engineering tasks, with operations on medium -> large data. For small data the overhead will be more noticeable. It definitely wasn't written to compete with |
Thanks @jcrist! That summarises better what I was trying to convey and hopefully helps the narwhal maintainers think this one through :) |
Thanks both for comments! 🙏 The objective of Narwhals is to be as close as possible to a zero-cost-abstraction, whilst keeping everything pure-Python and without required dependencies At a minimum, if we were to consider supporting DuckDB by going via Ibis, we would require that this be done without any non-lightweight dependencies and with negligible overhead. I understand that we're not yet in a situation where this is possible, but I was keen to get the conversation started Anyway, it's nice to see this issue bringing together maintainers of different projects to discuss things 🤗 |
Things are picking up here. It's involving a fair few refactors, which is probably a good thing. If anyone's interested in contributing support for the other libraries mentioned, I'd suggest waiting until Dask support is complete enough (let's say, able to execute tpc-h queries 1-7) - by the time that's happened, adding other backends should be much simpler |
This is really work-in-progress, and not-at-all ready-for-prod, but we're making progress in #1725 Running the benchmark above: from __future__ import annotations
import duckdb
from duckdb import ColumnExpression, ConstantExpression
import narwhals as nw
def native(penguins):
penguins = (
penguins.filter(
(ColumnExpression("species") == ConstantExpression("Gentoo"))
& (ColumnExpression("body_mass_g") > ConstantExpression(6000))
)
.project(
*[ColumnExpression(name) for name in penguins.columns],
(ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
"bill_length_cm"
),
)
.select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"bill_length_cm",
)
)
return penguins.arrow()
def nw_soln(penguins):
df = nw.from_native(penguins)
return (
df.filter(nw.col("species") == "Gentoo", nw.col("body_mass_g") > 6000)
.with_columns(bill_length_cm=nw.col("bill_length_mm") / 10)
.select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"bill_length_cm",
)
.collect()
.to_native()
)
print(native(duckdb.read_parquet("../scratch/penguins.parquet")))
print()
print(nw_soln(duckdb.read_parquet("../scratch/penguins.parquet"))) I see: In [2]: results = %timeit -o native(duckdb.read_parquet("../scratch/penguins.parquet"))
2.23 ms ± 190 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [3]: results.best
Out[3]: 0.0019655416079986028
In [4]: results = %timeit -o nw_soln(duckdb.read_parquet("../scratch/penguins.parquet"))
3.12 ms ± 221 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: results.best
Out[5]: 0.0026353435700002593
# ibis: around 9-10 ms, but as noted above it's an overhead that disappears for larger datasets There's still some overhead, and I'll see how to reduce it, but I think it looks encouraging, the DuckDB Relational is a lot nicer and more complete than I was expecting. The main thing we're missing in |
If we look at:
to_duckdb
ibis-project/ibis#9629 is addressed, or directly via their relational API)then it doesn't seem too hard to support the
narwhals.LazyFrame
API without running into footguns. We may need to disallow operations which operate on multiple columns' row orders independently (the alternative is doing a join on row order behind the scenes, which I'm not too keen, as we'd be making an expensive operation look cheap)Each one can then have an associated eager class, which is what things get transformed to if you call
.collect
:We can start with this, and possibly consider expanding support later. As they say - "in open source, 'no' is temporary, but 'yes' is forever" - doubly so in a library like Narwhals with a stable API policy 😄
Furthermore, we'd like to make it easier to extend Narwhals, so that if library authors disagree with us and want to add complexity which we wouldn't be happy with maintaining, they can be free to extend Narwhals themselves - all they'd need to do is to implement a few dunder-methods
The text was updated successfully, but these errors were encountered: