[FEATURE] PolynomialColumnTransformer with maximum number of columns, `n_dimensions` #715

CamiloMartinezM · 2024-11-09T13:08:53Z

Description

Add the ability to specify the maximum number of dimensions to include in the polynomial features combinations generated by sklearn.preprocessing.PolynomialFeatures, allowing users to apply polynomial transformations to only the first N columns while keeping remaining columns unchanged.

Use Case

When working with certain types of data, we might employ some feature importance techniques, which ends up in an "importance" sorting of features based on p_value, for instance, after a statistical test. And often only the first few features would need polynomial expansion while later features should remain untouched. Or simply to reduce computational overhead.
While doing feature engineering, we might already have domain knowledge that indicates that only specific columns benefit from polynomial interactions.
If we one-hot encode categorical columns, ending up in columns with [0, 1] values, doing polynomial expansion on these doesn't make sense and increases the number of columns of our dataset unnecessarily.
In general, provide more control to the user, reducing computational complexity and memory usage by limiting polynomial generation to relevant features.

I found myself coding this Transformer myself to be able to use it in a scikit-learn pipeline that would preprocess ~100 features in a Healthcare dataset, which quickly blew up in terms of the number of output columns, when applying PolynomialFeatures with degree <= 3.

Current Behavior

Currently, PolynomialFeatures transforms all input columns. We can specify a degree = (min_degree, max_degree) and a interaction_only=True or False, to limit the combinations. For example,

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.preprocessing import PolynomialFeatures

>>> X = pd.DataFrame(np.random.randn(100, 3), columns=list("abc"))
>>> PolynomialFeatures(degree=3, interaction_only=True).fit(X).get_feature_names_out(X.columns)
array(['1', 'a', 'b', 'c', 'a b', 'a c', 'b c', 'a b c'], dtype=object)

>>> PolynomialFeatures(degree=2, interaction_only=False).fit(X).get_feature_names_out(X.columns)
array(['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2'], dtype=object)

>>> PolynomialFeatures(degree=(2, 2), interaction_only=False).fit(X).get_feature_names_out(X.columns)
array(['1', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2'], dtype=object)

>>> PolynomialFeatures(degree=(2, 3), interaction_only=True).fit(X).get_feature_names_out(X.columns)
array(['1', 'a b', 'a c', 'b c', 'a b c'], dtype=object)

Someone already pitched a similar idea to scikit-learn here, which ended up in this PR. This change allowed specifying a tuple like this: degree = (min_degree, max_degree), whereas previously one could only specify a degree=int.

Hacky Solution?

I haven't tried this myself, but I guess you could potentially use sklego.preprocessing.ColumnSelector, which selects columns based on column name (I'm taking this out of the README file), apply PolynomialFeatures and then concatenate them with the remaining columns. But I find this unnecessarily complex, having to use three transformers: a column selector, polynomial features, and then a union to concatenate the features back. Also, it assumes the user knows the names of the columns at any point in the pipeline, which doesn't always work, since some scikit-learn transformers, including PolynomialFeatures perform changes to the names of the columns.

According to my humble knowledge, there is no other way to achieve this in an intuitive way with a single transformer.

Proposed Solution

Add n_dimensions parameter to control how many columns from the input should undergo polynomial transformation:

class PolynomialColumnTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, n_dimensions=None, degree=2, interaction_only=False, include_bias=True):
        # n_dimensions: number of columns to transform (from left to right)
        # if None, transforms all columns (default behavior)

Even though you don't specify which columns to use, but the first n_dimensions columns, you could sort the columns yourself first based on the feature importances or domain knowledge, so only the ones you want are used for polynomial expansion. Or one could use this transformer alongside other potential ones that allow you to move columns to the beginning or even sort them based on some feature engineering algorithm. Nevertheless, I'm open to suggestions to make this even more general. I already have the code for this one implemented, as well as very thorough tests to make sure nothing breaks in-between, so I could take this as a first issue, if you find it useful.

The text was updated successfully, but these errors were encountered:

FBruzzesi · 2024-11-10T20:26:42Z

Hey @CamiloMartinezM thanks for the very detailed explanation.

To me the proposed solution seems something which is already possible via sklearn.compose.ColumnTransformer in which you can specify either the list of column names or column position to transform, and decide what to do with the remaining columns:

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer

X = pd.DataFrame(np.random.randn(100, 3), columns=list("abc"))

cols = ["a", "b"] # or list(range(n_dimensions))

n_dim_polynomial = ColumnTransformer(
    transformers = [
        ("poly", PolynomialFeatures(degree=2), cols)
    ],
    remainder="passthrough"  # no needs to do union with the other columns, they are just passed through as they are
).set_output(transform="pandas")

n_dim_polynomial.fit_transform(X).head(2)

Resulting in:

	poly__1	poly__a	poly__b	poly__a^2	poly__a b	poly__b^2	remainder__c
0	1	0.299388	-0.633881	0.089633	-0.189776	0.401805	-0.137093
1	1	-0.0791178	1.48079	0.00625962	-0.117157	2.19274	-1.21945

koaning · 2024-11-10T21:16:45Z

Another alternative is to use the ColumnSelector in this project.

pipe = make_pipeline(
    ColumnSelector(['a', 'b']),
    PolynomialFeatures()
)

CamiloMartinezM added the enhancement New feature or request label Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] PolynomialColumnTransformer with maximum number of columns, `n_dimensions` #715

[FEATURE] PolynomialColumnTransformer with maximum number of columns, `n_dimensions` #715

CamiloMartinezM commented Nov 9, 2024

FBruzzesi commented Nov 10, 2024 •

edited

Loading

koaning commented Nov 10, 2024

[FEATURE] PolynomialColumnTransformer with maximum number of columns, n_dimensions #715

[FEATURE] PolynomialColumnTransformer with maximum number of columns, n_dimensions #715

Comments

CamiloMartinezM commented Nov 9, 2024

Description

Use Case

Current Behavior

Hacky Solution?

Proposed Solution

FBruzzesi commented Nov 10, 2024 • edited Loading

koaning commented Nov 10, 2024

[FEATURE] PolynomialColumnTransformer with maximum number of columns, `n_dimensions` #715

[FEATURE] PolynomialColumnTransformer with maximum number of columns, `n_dimensions` #715

FBruzzesi commented Nov 10, 2024 •

edited

Loading