feat: add "lazy-only" level of support #566

MarcoGorelli · 2024-07-21T13:03:52Z

If we look at:

dask.dataframe
ibis (personally I'd prefer to wait for them to complete the removal of pandas+pyarrow as required dependencies for all backends)
duckdb (either via Ibis, if feat: to_duckdb ibis-project/ibis#9629 is addressed, or directly via their relational API)

then it doesn't seem too hard to support the narwhals.LazyFrame API without running into footguns. We may need to disallow operations which operate on multiple columns' row orders independently (the alternative is doing a join on row order behind the scenes, which I'm not too keen, as we'd be making an expensive operation look cheap)

Each one can then have an associated eager class, which is what things get transformed to if you call .collect:

dask goes to pandas
ibis goes to pyarrow
duckdb goes to pyarrow

We can start with this, and possibly consider expanding support later. As they say - "in open source, 'no' is temporary, but 'yes' is forever" - doubly so in a library like Narwhals with a stable API policy 😄

Furthermore, we'd like to make it easier to extend Narwhals, so that if library authors disagree with us and want to add complexity which we wouldn't be happy with maintaining, they can be free to extend Narwhals themselves - all they'd need to do is to implement a few dunder-methods

The text was updated successfully, but these errors were encountered:

binste · 2024-07-22T18:26:53Z

I'm a fan of Ibis and use it regularly, mainly together with https://github.com/binste/dbt-ibis and I find it exciting to see that you're looking at it as a potential layer around duckdb.

However, given that one of the benefits of narwhals is a low overhead, I just wanted to note that all the type validation and abstractions in Ibis do add overhead which can add up. It might not be noticeable when working interactively with larger amounts of data but I had use cases in web applications where I had to rewrite code back from Ibis to pure Duckdb or Pandas due to this.

Here's a quick example, loosely based on this DuckDB-Ibis example. Ibis 9.1, duckdb 1.0.

import ibis

# Create some example database
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
    "penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
)

%%timeit
# reconnect to the persisted database (dropping temp tables)
con = ibis.connect("duckdb://penguins.ddb")
penguins = con.table("penguins")
penguins = (
    penguins.filter((penguins.species == "Gentoo") & (penguins.body_mass_g > 6000))
    .mutate(bill_length_cm=penguins.bill_length_mm / 10)
    .select(
        "species",
        "island",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
        "sex",
        "year",
        "bill_length_cm",
    )
)
penguins_arrow = penguins.to_pyarrow()

19.1 ms ± 482 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Same in pure duckdb:

%%timeit
con = duckdb.connect("penguins.ddb")
penguins = con.table("penguins")
penguins = (
    penguins.filter(
        (ColumnExpression("species") == ConstantExpression("Gentoo"))
        & (ColumnExpression("body_mass_g") > ConstantExpression(6000))
    )
    .project(
        *[ColumnExpression(name) for name in penguins.columns],
        (ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
            "bill_length_cm"
        )
    )
    .select(
        "species",
        "island",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
        "sex",
        "year",
        "bill_length_cm",
    )
)
penguins_arrow = penguins.arrow()

1.5 ms ± 9.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see ibis-project/ibis#9111, but some overhead might just be inherent.

Just thought you might appreciate the note that it's worth benchmarking first what you want to achieve in case you did not yet have it on your list already anyway :)

jcrist · 2024-07-22T19:02:06Z

Hope you don't mind me commenting here, this issue was linked in our issue tracker.

Just want to offer a small clarification on:

I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see ibis-project/ibis#9111, but some overhead might just be inherent.

Ibis's expressions definitely have some overhead, but that overhead scales relative to the size of your query, not your data. We're doing some work to reduce it in certain cases (as you noted, some operations scaled poorly on wide tables), but for most analytics use cases the overhead (in 10-100 ms in your examples above) is negligible compared to the execution time of the query. Ibis was designed to perform well for data analytics/engineering tasks, with operations on medium -> large data. For small data the overhead will be more noticeable. It definitely wasn't written to compete with SQLAlchemy-like webapp use cases!

binste · 2024-07-22T19:12:56Z

Thanks @jcrist! That summarises better what I was trying to convey and hopefully helps the narwhal maintainers think this one through :)

MarcoGorelli · 2024-07-22T20:25:33Z

Thanks both for comments! 🙏

The objective of Narwhals is to be as close as possible to a zero-cost-abstraction, whilst keeping everything pure-Python and without required dependencies

At a minimum, if we were to consider supporting DuckDB by going via Ibis, we would require that this be done without any non-lightweight dependencies and with negligible overhead. I understand that we're not yet in a situation where this is possible, but I was keen to get the conversation started

Anyway, it's nice to see this issue bringing together maintainers of different projects to discuss things 🤗

MarcoGorelli · 2024-07-29T17:50:59Z

dask.dataframe

Things are picking up here. It's involving a fair few refactors, which is probably a good thing. If anyone's interested in contributing support for the other libraries mentioned, I'd suggest waiting until Dask support is complete enough (let's say, able to execute tpc-h queries 1-7) - by the time that's happened, adding other backends should be much simpler

MarcoGorelli · 2025-01-05T12:52:47Z

This is really work-in-progress, and not-at-all ready-for-prod, but we're making progress in #1725

Running the benchmark above:

from __future__ import annotations

import duckdb
from duckdb import ColumnExpression, ConstantExpression
import narwhals as nw


def native(penguins):
    penguins = (
        penguins.filter(
            (ColumnExpression("species") == ConstantExpression("Gentoo"))
            & (ColumnExpression("body_mass_g") > ConstantExpression(6000))
        )
        .project(
            *[ColumnExpression(name) for name in penguins.columns],
            (ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
                "bill_length_cm"
            ),
        )
        .select(
            "species",
            "island",
            "bill_depth_mm",
            "flipper_length_mm",
            "body_mass_g",
            "sex",
            "bill_length_cm",
        )
    )
    return penguins.arrow()


def nw_soln(penguins):
    df = nw.from_native(penguins)
    return (
        df.filter(nw.col("species") == "Gentoo", nw.col("body_mass_g") > 6000)
        .with_columns(bill_length_cm=nw.col("bill_length_mm") / 10)
        .select(
            "species",
            "island",
            "bill_depth_mm",
            "flipper_length_mm",
            "body_mass_g",
            "sex",
            "bill_length_cm",
        )
        .collect()
        .to_native()
    )


print(native(duckdb.read_parquet("../scratch/penguins.parquet")))
print()
print(nw_soln(duckdb.read_parquet("../scratch/penguins.parquet")))

I see:

In [2]: results = %timeit -o native(duckdb.read_parquet("../scratch/penguins.parquet"))
2.23 ms ± 190 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: results.best
Out[3]: 0.0019655416079986028

In [4]: results = %timeit -o nw_soln(duckdb.read_parquet("../scratch/penguins.parquet"))
3.12 ms ± 221 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: results.best
Out[5]: 0.0026353435700002593

# ibis: around 9-10 ms, but as noted above it's an overhead that disappears for larger datasets

There's still some overhead, and I'll see how to reduce it, but I think it looks encouraging, the DuckDB Relational is a lot nicer and more complete than I was expecting. The main thing we're missing in WindowExpression, but it looks like it's on their roadmap

MarcoGorelli added the enhancement New feature or request label Jul 21, 2024

This was referenced Jul 21, 2024

[Enh]: Add support for dask #396

Closed

feat: Dask Support Implementation #484

Closed

MarcoGorelli mentioned this issue Aug 29, 2024

Ibis backend #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add "lazy-only" level of support #566

feat: add "lazy-only" level of support #566

MarcoGorelli commented Jul 21, 2024 •

edited

Loading

binste commented Jul 22, 2024

jcrist commented Jul 22, 2024 •

edited

Loading

binste commented Jul 22, 2024

MarcoGorelli commented Jul 22, 2024

MarcoGorelli commented Jul 29, 2024

MarcoGorelli commented Jan 5, 2025 •

edited

Loading

feat: add "lazy-only" level of support #566

feat: add "lazy-only" level of support #566

Comments

MarcoGorelli commented Jul 21, 2024 • edited Loading

binste commented Jul 22, 2024

jcrist commented Jul 22, 2024 • edited Loading

binste commented Jul 22, 2024

MarcoGorelli commented Jul 22, 2024

MarcoGorelli commented Jul 29, 2024

MarcoGorelli commented Jan 5, 2025 • edited Loading

MarcoGorelli commented Jul 21, 2024 •

edited

Loading

jcrist commented Jul 22, 2024 •

edited

Loading

MarcoGorelli commented Jan 5, 2025 •

edited

Loading