-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default .join behavior with categorical columns #7632
Comments
I also found that
Since this is not the default behavior, what the potential drawbacks with |
(I have no real insight, this is just me rambling.) print(pl.DataFrame([
pl.Series("s", ["A", "B"], dtype=pl.Categorical),
pl.Series("t", ["B", "A"], dtype=pl.Categorical)]).select(
"s", "t",
s_phys=pl.col("s").to_physical(),
t_phys=pl.col("t").to_physical()))
"""
shape: (2, 4)
┌─────┬─────┬────────┬────────┐
│ s ┆ t ┆ s_phys ┆ t_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═════╪═════╪════════╪════════╡
│ A ┆ B ┆ 0 ┆ 0 │
│ B ┆ A ┆ 1 ┆ 1 │
└─────┴─────┴────────┴────────┘
""" And this is what happens if you enable a global string cache: pl.toggle_string_cache(True)
print(pl.DataFrame([
pl.Series("s", ["A", "B"], dtype=pl.Categorical),
pl.Series("t", ["B", "A"], dtype=pl.Categorical)]).select(
"s", "t",
s_phys=pl.col("s").to_physical(),
t_phys=pl.col("t").to_physical()))
"""
shape: (2, 4)
┌─────┬─────┬────────┬────────┐
│ s ┆ t ┆ s_phys ┆ t_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═════╪═════╪════════╪════════╡
│ A ┆ B ┆ 0 ┆ 1 │
│ B ┆ A ┆ 1 ┆ 0 │
└─────┴─────┴────────┴────────┘
""" Presumably it's disabled by default for performance reasons, so different categorical columns don't need to "waste time" synchronizing their physical representations. I do agree it's very irksome that the mapping to a common physical representation isn't performed automatically when you join |
@s-banach Thanks for the insights. My understanding is that a separate StringCache is created whenever you create/convert a categorical dtype. Thus, even though the data1 = {
'col': ['a', 'b', 'c'],
'col_1': [1, 2, 3]
}
data2 = {
'col': ['a', 'b', 'c'],
'col_2': [1, 2, 3]
}
df_1 = pl.DataFrame(data1, schema={'col': pl.Categorical, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data2, schema={'col': pl.Categorical, 'col_2': pl.Int64})
df_1.join(df_2, on='col', how='outer') If this is indeed the case, I wonder if it would be a good idea to just set up a StringCache in the first place like a CategoricalDtype in pandas, so cat = pl.StringCache(['a', 'b', 'c'])
df_1 = pl.DataFrame(data1, schema={'col': cat, 'col_1': pl.Int64})
df_2 = pl.DataFrame(data2, schema={'col': cat, 'col_2': pl.Int64}) # in pandas:
cat = pd.CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)
df_1.col = df_1.col.astype(cat) |
There are some nuances when it comes to joining on categoricals. I am hoping to explain those in #11822 in more detail the user guide. Feedback on this one would be greatly appreciated. |
Super cool work! Will pandas categorical dtypes be converted to Enum now? |
The prose of the new categorical user guide might need to be slightly polished, but it’s hard to type suggestions now on my phone. |
Problem description
The current implement
0.16.14
of .join requires the key columns under the same global string cache when they are categorical.While in pandas, they will be kept Categorical if the categories are the same while being converted to string if different.
It seems a bit tedious and disruptive for me that I have to write a context manager
with pl.StringCache()
whenever I have to join two dataframes together. I wonder if this could be done on the fly:The text was updated successfully, but these errors were encountered: