Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default .join behavior with categorical columns #7632

Closed
stevenlis opened this issue Mar 18, 2023 · 6 comments · Fixed by #12657
Closed

Default .join behavior with categorical columns #7632

stevenlis opened this issue Mar 18, 2023 · 6 comments · Fixed by #12657
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@stevenlis
Copy link

Problem description

The current implement 0.16.14 of .join requires the key columns under the same global string cache when they are categorical.

import pandas as pd
import polars as pl

data_1 = {
    'col': ['a', 'b', 'c'],
    'col_1': [1, 2, 3]
}
data_2 = {
    'col': ['b', 'c', 'd'],
    'col_2': [1, 2, 3]
}

df_1 = pl.DataFrame(data_1, schema={'col': pl.Categorical, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data_2, schema={'col': pl.Categorical, 'col_2': pl.Int64})
df_1.join(df_2, on='col', how='outer')
ComputeError: joins/or comparisons on categoricals can only happen 
if they were created under the same global string cache

While in pandas, they will be kept Categorical if the categories are the same while being converted to string if different.

df_1 = pd.DataFrame(data_1, dtype='category')
df_2 = pd.DataFrame(data_2, dtype='category')
df_1.merge(df_2, on='col', how='outer').col.dtypes
dtype('O')

It seems a bit tedious and disruptive for me that I have to write a context manager with pl.StringCache() whenever I have to join two dataframes together. I wonder if this could be done on the fly:

  1. If the categories (regardless their orders) are the same, polars convert them to the same global string cache.
  2. If the categories are different, polars extend the categories and convert them to the same global string cache.
  3. If polars plans to implement order categorical dtype in the future: when the categories are different, convert the joined columns to string or unordered categories.
@stevenlis stevenlis added the enhancement New feature or an improvement of an existing feature label Mar 18, 2023
@stevenlis
Copy link
Author

I also found that

If you want the global string cache to exist during the whole run, you can set toggle_string_cache to True

Since this is not the default behavior, what the potential drawbacks with pl.toggle_string_cache(True)

@s-banach
Copy link
Contributor

s-banach commented Mar 18, 2023

(I have no real insight, this is just me rambling.)
This is how polars normally handles categorical columns:

print(pl.DataFrame([
    pl.Series("s", ["A", "B"], dtype=pl.Categorical),
    pl.Series("t", ["B", "A"], dtype=pl.Categorical)]).select(
    "s", "t",
    s_phys=pl.col("s").to_physical(),
    t_phys=pl.col("t").to_physical()))
"""
shape: (2, 4)
┌─────┬─────┬────────┬────────┐
│ s   ┆ t   ┆ s_phys ┆ t_phys │
│ --- ┆ --- ┆ ---    ┆ ---    │
│ cat ┆ cat ┆ u32    ┆ u32    │
╞═════╪═════╪════════╪════════╡
│ A   ┆ B   ┆ 0      ┆ 0      │
│ B   ┆ A   ┆ 1      ┆ 1      │
└─────┴─────┴────────┴────────┘
"""

And this is what happens if you enable a global string cache:

pl.toggle_string_cache(True)
print(pl.DataFrame([
    pl.Series("s", ["A", "B"], dtype=pl.Categorical),
    pl.Series("t", ["B", "A"], dtype=pl.Categorical)]).select(
    "s", "t",
    s_phys=pl.col("s").to_physical(),
    t_phys=pl.col("t").to_physical()))
"""
shape: (2, 4)
┌─────┬─────┬────────┬────────┐
│ s   ┆ t   ┆ s_phys ┆ t_phys │
│ --- ┆ --- ┆ ---    ┆ ---    │
│ cat ┆ cat ┆ u32    ┆ u32    │
╞═════╪═════╪════════╪════════╡
│ A   ┆ B   ┆ 0      ┆ 1      │
│ B   ┆ A   ┆ 1      ┆ 0      │
└─────┴─────┴────────┴────────┘
"""

Presumably it's disabled by default for performance reasons, so different categorical columns don't need to "waste time" synchronizing their physical representations.

I do agree it's very irksome that the mapping to a common physical representation isn't performed automatically when you join df_1 and df_2. The whole point of categorical columns is that they should have low cardinality (a small number of unique values), so the operation shouldn't take very long.

@stevenlis
Copy link
Author

@s-banach Thanks for the insights. My understanding is that a separate StringCache is created whenever you create/convert a categorical dtype. Thus, even though the col columns in the following dataframes share the same categories, you still get an error and have to manually create another StringCache in a context manager for them.

data1 = {
    'col': ['a', 'b', 'c'],
    'col_1': [1, 2, 3]
}
data2 = {
    'col': ['a', 'b', 'c'],
    'col_2': [1, 2, 3]
}

df_1 = pl.DataFrame(data1, schema={'col': pl.Categorical, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data2, schema={'col': pl.Categorical, 'col_2': pl.Int64})

df_1.join(df_2, on='col', how='outer')

If this is indeed the case, I wonder if it would be a good idea to just set up a StringCache in the first place like a CategoricalDtype in pandas, so

cat = pl.StringCache(['a', 'b', 'c'])
df_1 = pl.DataFrame(data1, schema={'col': cat, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data2, schema={'col': cat, 'col_2': pl.Int64})
# in pandas:
cat = pd.CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)
df_1.col = df_1.col.astype(cat)

@c-peters
Copy link
Collaborator

There are some nuances when it comes to joining on categoricals. I am hoping to explain those in #11822 in more detail the user guide. Feedback on this one would be greatly appreciated.

@s-banach
Copy link
Contributor

Super cool work!
My main concern is handling arbitrary pandas dataframes from end users who don’t know polars yet.

Will pandas categorical dtypes be converted to Enum now?
Is there any plan to allow non-string Enum types?

@s-banach
Copy link
Contributor

The prose of the new categorical user guide might need to be slightly polished, but it’s hard to type suggestions now on my phone.

@c-peters c-peters added the accepted Ready for implementation label Dec 1, 2023
@c-peters c-peters self-assigned this Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants