Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add new Enum categorical data type which allows a fixed set of categories #11822

Merged
merged 22 commits into from
Dec 1, 2023

Conversation

c-peters
Copy link
Collaborator

@c-peters c-peters commented Oct 18, 2023

Relates to #10705

This PR is meant as a start of a sequence of PRs to improve the categoricals in Polars
This allows users to provide a fixed list of categories when initializing / casting to a categorical

s = pl.Series('a', ['a','b','c'], dtype=pl.Enum(['a','b','c']))

df = pl.LazyFrame({'a':['a','b','c']}).with_columns(pl.col('a').cast(pl.Enum(['a','b','c']))).collect()

Specifying a value outside of the provided list will create an error

s = pl.Series('a', ['a','b','c'], dtype=pl.Enum(['a','b']))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chiel/Documents/polars/polars/py-polars/polars/series/series.py", line 283, in __init__
    self._s = sequence_to_pyseries(
  File "/home/chiel/Documents/polars/polars/py-polars/polars/utils/_construction.py", line 423, in sequence_to_pyseries
    pyseries = pyseries.cast(dtype, strict=True)
exceptions.OutOfBoundsError: Value c in string column not found in fixed set of categories LargeUtf8Array[a, b]

Note that this is in addition to the categorical. The distinction is that categorical types are flexible and new categories get added on the fly while with Enum they are fixed and don't change.

Todo in future PRs

@stinodego
Copy link
Contributor

stinodego commented Oct 18, 2023

@c-peters looks like you have to rebase on main to get the lints to pass! looks like you already did 😅

@stinodego stinodego changed the title Allow fixed set of categories in Dtype Categorical feat: Allow fixed set of categories in Dtype Categorical Oct 18, 2023
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 18, 2023
@ritchie46
Copy link
Member

Hope to get to this one end of today.

Copy link
Contributor

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a WIP but I spotted some minor things on the Python side - figured I might as well leave a comment.

py-polars/polars/datatypes/classes.py Outdated Show resolved Hide resolved
py-polars/polars/datatypes/classes.py Outdated Show resolved Hide resolved
py-polars/polars/utils/_construction.py Outdated Show resolved Hide resolved
@c-peters
Copy link
Collaborator Author

Waiting on #12091 to fix failing test that are due to another bug

@c-peters c-peters marked this pull request as draft November 21, 2023 10:11
@c-peters c-peters requested a review from ritchie46 November 22, 2023 15:54
@c-peters c-peters marked this pull request as ready for review November 22, 2023 15:54
Copy link
Contributor

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! It's going to make a lot of users very happy.

I left a whole bunch of nitpick comments but overall this looks solid, at least on the Python side (I'll leave the Rust side to Ritchie for now).

We have some ways to go for better integration of this data type such as making sure it works with the interchange protocol __dataframe__ method and probably some others, but we can pick those up as we go along.

mkdocs.yml Outdated Show resolved Hide resolved
py-polars/polars/datatypes/classes.py Outdated Show resolved Hide resolved
py-polars/polars/datatypes/classes.py Outdated Show resolved Hide resolved
py-polars/polars/datatypes/classes.py Outdated Show resolved Hide resolved
py-polars/polars/datatypes/classes.py Outdated Show resolved Hide resolved
docs/user-guide/concepts/data-types/categoricals.md Outdated Show resolved Hide resolved
docs/user-guide/concepts/data-types/categoricals.md Outdated Show resolved Hide resolved
docs/user-guide/concepts/data-types/categoricals.md Outdated Show resolved Hide resolved
docs/user-guide/concepts/data-types/categoricals.md Outdated Show resolved Hide resolved
@ritchie46
Copy link
Member

ritchie46 commented Dec 1, 2023

Can you rebase? Then we can get this in!

@stinodego
Copy link
Contributor

stinodego commented Dec 1, 2023

Lint failure is my fault (dependency updates) - fixing as we speak. Another rebase is probably needed.

EDIT: Rebased.

@ritchie46 ritchie46 merged commit 93e37d4 into pola-rs:main Dec 1, 2023
29 checks passed
@stinodego stinodego added the highlight Highlight this PR in the changelog label Dec 1, 2023
@c-peters c-peters added the accepted Ready for implementation label Dec 1, 2023
@c-peters c-peters self-assigned this Dec 1, 2023
@stinodego stinodego changed the title feat: Allow fixed set of categories in Dtype Categorical feat: Add new Enum categorical data type which allows a fixed set of categories Dec 11, 2023
@c-peters c-peters deleted the categorical_cats branch December 28, 2023 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants