Default .join behavior with categorical columns #7632

stevenlis · 2023-03-18T21:38:53Z

Problem description

The current implement 0.16.14 of .join requires the key columns under the same global string cache when they are categorical.

import pandas as pd
import polars as pl

data_1 = {
    'col': ['a', 'b', 'c'],
    'col_1': [1, 2, 3]
}
data_2 = {
    'col': ['b', 'c', 'd'],
    'col_2': [1, 2, 3]
}

df_1 = pl.DataFrame(data_1, schema={'col': pl.Categorical, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data_2, schema={'col': pl.Categorical, 'col_2': pl.Int64})
df_1.join(df_2, on='col', how='outer')

ComputeError: joins/or comparisons on categoricals can only happen 
if they were created under the same global string cache

While in pandas, they will be kept Categorical if the categories are the same while being converted to string if different.

df_1 = pd.DataFrame(data_1, dtype='category')
df_2 = pd.DataFrame(data_2, dtype='category')
df_1.merge(df_2, on='col', how='outer').col.dtypes

dtype('O')

It seems a bit tedious and disruptive for me that I have to write a context manager with pl.StringCache() whenever I have to join two dataframes together. I wonder if this could be done on the fly:

If the categories (regardless their orders) are the same, polars convert them to the same global string cache.
If the categories are different, polars extend the categories and convert them to the same global string cache.
If polars plans to implement order categorical dtype in the future: when the categories are different, convert the joined columns to string or unordered categories.

The text was updated successfully, but these errors were encountered:

stevenlis · 2023-03-18T22:21:44Z

I also found that

If you want the global string cache to exist during the whole run, you can set toggle_string_cache to True

Since this is not the default behavior, what the potential drawbacks with pl.toggle_string_cache(True)

s-banach · 2023-03-18T22:38:12Z

(I have no real insight, this is just me rambling.)
This is how polars normally handles categorical columns:

print(pl.DataFrame([
    pl.Series("s", ["A", "B"], dtype=pl.Categorical),
    pl.Series("t", ["B", "A"], dtype=pl.Categorical)]).select(
    "s", "t",
    s_phys=pl.col("s").to_physical(),
    t_phys=pl.col("t").to_physical()))
"""
shape: (2, 4)
┌─────┬─────┬────────┬────────┐
│ s   ┆ t   ┆ s_phys ┆ t_phys │
│ --- ┆ --- ┆ ---    ┆ ---    │
│ cat ┆ cat ┆ u32    ┆ u32    │
╞═════╪═════╪════════╪════════╡
│ A   ┆ B   ┆ 0      ┆ 0      │
│ B   ┆ A   ┆ 1      ┆ 1      │
└─────┴─────┴────────┴────────┘
"""

And this is what happens if you enable a global string cache:

pl.toggle_string_cache(True)
print(pl.DataFrame([
    pl.Series("s", ["A", "B"], dtype=pl.Categorical),
    pl.Series("t", ["B", "A"], dtype=pl.Categorical)]).select(
    "s", "t",
    s_phys=pl.col("s").to_physical(),
    t_phys=pl.col("t").to_physical()))
"""
shape: (2, 4)
┌─────┬─────┬────────┬────────┐
│ s   ┆ t   ┆ s_phys ┆ t_phys │
│ --- ┆ --- ┆ ---    ┆ ---    │
│ cat ┆ cat ┆ u32    ┆ u32    │
╞═════╪═════╪════════╪════════╡
│ A   ┆ B   ┆ 0      ┆ 1      │
│ B   ┆ A   ┆ 1      ┆ 0      │
└─────┴─────┴────────┴────────┘
"""

Presumably it's disabled by default for performance reasons, so different categorical columns don't need to "waste time" synchronizing their physical representations.

I do agree it's very irksome that the mapping to a common physical representation isn't performed automatically when you join df_1 and df_2. The whole point of categorical columns is that they should have low cardinality (a small number of unique values), so the operation shouldn't take very long.

stevenlis · 2023-03-18T23:06:32Z

@s-banach Thanks for the insights. My understanding is that a separate StringCache is created whenever you create/convert a categorical dtype. Thus, even though the col columns in the following dataframes share the same categories, you still get an error and have to manually create another StringCache in a context manager for them.

data1 = {
    'col': ['a', 'b', 'c'],
    'col_1': [1, 2, 3]
}
data2 = {
    'col': ['a', 'b', 'c'],
    'col_2': [1, 2, 3]
}

df_1 = pl.DataFrame(data1, schema={'col': pl.Categorical, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data2, schema={'col': pl.Categorical, 'col_2': pl.Int64})

df_1.join(df_2, on='col', how='outer')

If this is indeed the case, I wonder if it would be a good idea to just set up a StringCache in the first place like a CategoricalDtype in pandas, so

cat = pl.StringCache(['a', 'b', 'c'])
df_1 = pl.DataFrame(data1, schema={'col': cat, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data2, schema={'col': cat, 'col_2': pl.Int64})

# in pandas:
cat = pd.CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)
df_1.col = df_1.col.astype(cat)

c-peters · 2023-11-24T09:45:01Z

There are some nuances when it comes to joining on categoricals. I am hoping to explain those in #11822 in more detail the user guide. Feedback on this one would be greatly appreciated.

s-banach · 2023-11-24T15:45:32Z

Super cool work!
My main concern is handling arbitrary pandas dataframes from end users who don’t know polars yet.

Will pandas categorical dtypes be converted to Enum now?
Is there any plan to allow non-string Enum types?

s-banach · 2023-11-24T15:52:03Z

The prose of the new categorical user guide might need to be slightly polished, but it’s hard to type suggestions now on my phone.

stevenlis added the enhancement New feature or an improvement of an existing feature label Mar 18, 2023

stevenlis mentioned this issue Jul 31, 2023

Support user-defined order in Expr.cat.set_ordering #8678

Closed

avimallu mentioned this issue Jul 6, 2023

Add section on categorical data to the user guide #11110

Closed

c-peters mentioned this issue Nov 24, 2023

feat: Join operations on local categoricals #12657

Merged

ritchie46 closed this as completed in #12657 Nov 24, 2023

c-peters added the accepted Ready for implementation label Dec 1, 2023

c-peters self-assigned this Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default .join behavior with categorical columns #7632

Default .join behavior with categorical columns #7632

stevenlis commented Mar 18, 2023

stevenlis commented Mar 18, 2023

s-banach commented Mar 18, 2023 •

edited

Loading

stevenlis commented Mar 18, 2023

c-peters commented Nov 24, 2023

s-banach commented Nov 24, 2023

s-banach commented Nov 24, 2023

Default .join behavior with categorical columns #7632

Default .join behavior with categorical columns #7632

Comments

stevenlis commented Mar 18, 2023

Problem description

stevenlis commented Mar 18, 2023

s-banach commented Mar 18, 2023 • edited Loading

stevenlis commented Mar 18, 2023

c-peters commented Nov 24, 2023

s-banach commented Nov 24, 2023

s-banach commented Nov 24, 2023

s-banach commented Mar 18, 2023 •

edited

Loading