Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add "lazy-only" level of support #566

Open
MarcoGorelli opened this issue Jul 21, 2024 · 6 comments
Open

feat: add "lazy-only" level of support #566

MarcoGorelli opened this issue Jul 21, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jul 21, 2024

If we look at:

  • dask.dataframe
  • ibis (personally I'd prefer to wait for them to complete the removal of pandas+pyarrow as required dependencies for all backends)
  • duckdb (either via Ibis, if feat: to_duckdb ibis-project/ibis#9629 is addressed, or directly via their relational API)

then it doesn't seem too hard to support the narwhals.LazyFrame API without running into footguns. We may need to disallow operations which operate on multiple columns' row orders independently (the alternative is doing a join on row order behind the scenes, which I'm not too keen, as we'd be making an expensive operation look cheap)

Each one can then have an associated eager class, which is what things get transformed to if you call .collect:

  • dask goes to pandas
  • ibis goes to pyarrow
  • duckdb goes to pyarrow

We can start with this, and possibly consider expanding support later. As they say - "in open source, 'no' is temporary, but 'yes' is forever" - doubly so in a library like Narwhals with a stable API policy 😄

Furthermore, we'd like to make it easier to extend Narwhals, so that if library authors disagree with us and want to add complexity which we wouldn't be happy with maintaining, they can be free to extend Narwhals themselves - all they'd need to do is to implement a few dunder-methods

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Jul 21, 2024
@binste
Copy link

binste commented Jul 22, 2024

I'm a fan of Ibis and use it regularly, mainly together with https://github.com/binste/dbt-ibis and I find it exciting to see that you're looking at it as a potential layer around duckdb.

However, given that one of the benefits of narwhals is a low overhead, I just wanted to note that all the type validation and abstractions in Ibis do add overhead which can add up. It might not be noticeable when working interactively with larger amounts of data but I had use cases in web applications where I had to rewrite code back from Ibis to pure Duckdb or Pandas due to this.

Here's a quick example, loosely based on this DuckDB-Ibis example. Ibis 9.1, duckdb 1.0.

import ibis

# Create some example database
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
    "penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
)
%%timeit
# reconnect to the persisted database (dropping temp tables)
con = ibis.connect("duckdb://penguins.ddb")
penguins = con.table("penguins")
penguins = (
    penguins.filter((penguins.species == "Gentoo") & (penguins.body_mass_g > 6000))
    .mutate(bill_length_cm=penguins.bill_length_mm / 10)
    .select(
        "species",
        "island",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
        "sex",
        "year",
        "bill_length_cm",
    )
)
penguins_arrow = penguins.to_pyarrow()

19.1 ms ± 482 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Same in pure duckdb:

%%timeit
con = duckdb.connect("penguins.ddb")
penguins = con.table("penguins")
penguins = (
    penguins.filter(
        (ColumnExpression("species") == ConstantExpression("Gentoo"))
        & (ColumnExpression("body_mass_g") > ConstantExpression(6000))
    )
    .project(
        *[ColumnExpression(name) for name in penguins.columns],
        (ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
            "bill_length_cm"
        )
    )
    .select(
        "species",
        "island",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
        "sex",
        "year",
        "bill_length_cm",
    )
)
penguins_arrow = penguins.arrow()

1.5 ms ± 9.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see ibis-project/ibis#9111, but some overhead might just be inherent.

Just thought you might appreciate the note that it's worth benchmarking first what you want to achieve in case you did not yet have it on your list already anyway :)

@jcrist
Copy link

jcrist commented Jul 22, 2024

Hope you don't mind me commenting here, this issue was linked in our issue tracker.

Just want to offer a small clarification on:

I had examples with many more steps and wider tables where the difference was > 100ms. For wide tables, there will be some improvements in Ibis 9.2, see ibis-project/ibis#9111, but some overhead might just be inherent.

Ibis's expressions definitely have some overhead, but that overhead scales relative to the size of your query, not your data. We're doing some work to reduce it in certain cases (as you noted, some operations scaled poorly on wide tables), but for most analytics use cases the overhead (in 10-100 ms in your examples above) is negligible compared to the execution time of the query. Ibis was designed to perform well for data analytics/engineering tasks, with operations on medium -> large data. For small data the overhead will be more noticeable. It definitely wasn't written to compete with SQLAlchemy-like webapp use cases!

@binste
Copy link

binste commented Jul 22, 2024

Thanks @jcrist! That summarises better what I was trying to convey and hopefully helps the narwhal maintainers think this one through :)

@MarcoGorelli
Copy link
Member Author

Thanks both for comments! 🙏

The objective of Narwhals is to be as close as possible to a zero-cost-abstraction, whilst keeping everything pure-Python and without required dependencies

At a minimum, if we were to consider supporting DuckDB by going via Ibis, we would require that this be done without any non-lightweight dependencies and with negligible overhead. I understand that we're not yet in a situation where this is possible, but I was keen to get the conversation started

Anyway, it's nice to see this issue bringing together maintainers of different projects to discuss things 🤗

@MarcoGorelli
Copy link
Member Author

dask.dataframe

Things are picking up here. It's involving a fair few refactors, which is probably a good thing. If anyone's interested in contributing support for the other libraries mentioned, I'd suggest waiting until Dask support is complete enough (let's say, able to execute tpc-h queries 1-7) - by the time that's happened, adding other backends should be much simpler

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 5, 2025

This is really work-in-progress, and not-at-all ready-for-prod, but we're making progress in #1725

Running the benchmark above:

from __future__ import annotations

import duckdb
from duckdb import ColumnExpression, ConstantExpression
import narwhals as nw


def native(penguins):
    penguins = (
        penguins.filter(
            (ColumnExpression("species") == ConstantExpression("Gentoo"))
            & (ColumnExpression("body_mass_g") > ConstantExpression(6000))
        )
        .project(
            *[ColumnExpression(name) for name in penguins.columns],
            (ColumnExpression("bill_length_mm") / ConstantExpression(10)).alias(
                "bill_length_cm"
            ),
        )
        .select(
            "species",
            "island",
            "bill_depth_mm",
            "flipper_length_mm",
            "body_mass_g",
            "sex",
            "bill_length_cm",
        )
    )
    return penguins.arrow()


def nw_soln(penguins):
    df = nw.from_native(penguins)
    return (
        df.filter(nw.col("species") == "Gentoo", nw.col("body_mass_g") > 6000)
        .with_columns(bill_length_cm=nw.col("bill_length_mm") / 10)
        .select(
            "species",
            "island",
            "bill_depth_mm",
            "flipper_length_mm",
            "body_mass_g",
            "sex",
            "bill_length_cm",
        )
        .collect()
        .to_native()
    )


print(native(duckdb.read_parquet("../scratch/penguins.parquet")))
print()
print(nw_soln(duckdb.read_parquet("../scratch/penguins.parquet")))

I see:

In [2]: results = %timeit -o native(duckdb.read_parquet("../scratch/penguins.parquet"))
2.23 ms ± 190 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: results.best
Out[3]: 0.0019655416079986028

In [4]: results = %timeit -o nw_soln(duckdb.read_parquet("../scratch/penguins.parquet"))
3.12 ms ± 221 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: results.best
Out[5]: 0.0026353435700002593

# ibis: around 9-10 ms, but as noted above it's an overhead that disappears for larger datasets

There's still some overhead, and I'll see how to reduce it, but I think it looks encouraging, the DuckDB Relational is a lot nicer and more complete than I was expecting. The main thing we're missing in WindowExpression, but it looks like it's on their roadmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants