Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Itertools helpers #318

Open
jesusestevez opened this issue Jan 27, 2025 · 4 comments
Open

Feature request: Itertools helpers #318

jesusestevez opened this issue Jan 27, 2025 · 4 comments
Assignees

Comments

@jesusestevez
Copy link

Looking for a solution to implement e combination of two lists, I have noticed these two questions in StackOverflow:

https://stackoverflow.com/questions/77354114/how-would-i-generate-combinations-of-items-within-polars-using-the-native-expres

https://stackoverflow.com/questions/79340441/python-polars-expression-list-product

and request for enhancement here: pola-rs/polars#11999

I think this would be a great addition to the set of data science support tools provided by this extension, allowing to apply cross_join or combinations of lists without the hustle of recurring to map_elements.

@abstractqqq
Copy link
Owner

abstractqqq commented Jan 28, 2025

Sure. I can give it a try. But no guarantee..

The real issue is in the stackoverflow post is that the user is running out of memory. This operation is very expensive.. We can get rid of the cross join, which might save a bit memory but still it will be expensive because n choose k can get quite large.

I also don't think plugins can support streaming (not yet).

@abstractqqq
Copy link
Owner

I have a PR here: #322, which adds the function below and you can call it via pds.combinations(...)

def combinations(source: str | pl.Expr, k: int, unique: bool = False) -> pl.Expr:
    """
    Get all k-combinations of non-null values in source. This is an expensive operation, as
    n choose k can grow very fast.

    Parameters
    ----------
    source
        Input source column, must have numeric or string type
    k
        The k in N choose k
    unique
        Whether to run .unique() on the source column

    Examples
    --------
    >>> df = pl.DataFrame({
    >>>     "category": ["a", "a", "a", "b", "b"],
    >>>     "values": [1, 2, 3, 4, 5]
    >>> })
    >>> df.select(
    >>>     pds.combinations("values", 3)
    >>> )
    shape: (10, 1)
    ┌───────────┐
    │ values    │
    │ ---       │
    │ list[i64] │
    ╞═══════════╡
    │ [1, 2, 3] │
    │ [1, 2, 4] │
    │ [1, 2, 5] │
    │ [1, 3, 4] │
    │ [1, 3, 5] │
    │ [1, 4, 5] │
    │ [2, 3, 4] │
    │ [2, 3, 5] │
    │ [2, 4, 5] │
    │ [3, 4, 5] │
    └───────────┘
    >>> df.group_by("category").agg(
    >>>     pds.combinations("values", 2)
    >>> )
    shape: (2, 2)
    ┌──────────┬──────────────────────────┐
    │ category ┆ values                   │
    │ ---      ┆ ---                      │
    │ str      ┆ list[list[i64]]          │
    ╞══════════╪══════════════════════════╡
    │ a        ┆ [[1, 2], [1, 3], [2, 3]] │
    │ b        ┆ [[4, 5]]                 │
    └──────────┴──────────────────────────┘
    """

@abstractqqq abstractqqq self-assigned this Feb 4, 2025
@jesusestevez
Copy link
Author

This is amazing! However, it seems I cannot install from the branch as I require Rust to be installed and have some limitations to do so in my machine.

Out of curiosity, is the itertools.product expected in the near future too?

@abstractqqq
Copy link
Owner

This is amazing! However, it seems I cannot install from the branch as I require Rust to be installed and have some limitations to do so in my machine.

Out of curiosity, is the itertools.product expected in the near future too?

It is plausible. These are very similar operations and really depends on how much demand there is..

I believe I will publish v0.8.1 which will have the combination function in the middle of the month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants