Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] PolynomialColumnTransformer with maximum number of columns, n_dimensions #715

Open
CamiloMartinezM opened this issue Nov 9, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@CamiloMartinezM
Copy link

Description

Add the ability to specify the maximum number of dimensions to include in the polynomial features combinations generated by sklearn.preprocessing.PolynomialFeatures, allowing users to apply polynomial transformations to only the first N columns while keeping remaining columns unchanged.

Use Case

  • When working with certain types of data, we might employ some feature importance techniques, which ends up in an "importance" sorting of features based on p_value, for instance, after a statistical test. And often only the first few features would need polynomial expansion while later features should remain untouched. Or simply to reduce computational overhead.
  • While doing feature engineering, we might already have domain knowledge that indicates that only specific columns benefit from polynomial interactions.
  • If we one-hot encode categorical columns, ending up in columns with [0, 1] values, doing polynomial expansion on these doesn't make sense and increases the number of columns of our dataset unnecessarily.
  • In general, provide more control to the user, reducing computational complexity and memory usage by limiting polynomial generation to relevant features.

I found myself coding this Transformer myself to be able to use it in a scikit-learn pipeline that would preprocess ~100 features in a Healthcare dataset, which quickly blew up in terms of the number of output columns, when applying PolynomialFeatures with degree <= 3.

Current Behavior

Currently, PolynomialFeatures transforms all input columns. We can specify a degree = (min_degree, max_degree) and a interaction_only=True or False, to limit the combinations. For example,

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.preprocessing import PolynomialFeatures

>>> X = pd.DataFrame(np.random.randn(100, 3), columns=list("abc"))
>>> PolynomialFeatures(degree=3, interaction_only=True).fit(X).get_feature_names_out(X.columns)
array(['1', 'a', 'b', 'c', 'a b', 'a c', 'b c', 'a b c'], dtype=object)

>>> PolynomialFeatures(degree=2, interaction_only=False).fit(X).get_feature_names_out(X.columns)
array(['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2'], dtype=object)

>>> PolynomialFeatures(degree=(2, 2), interaction_only=False).fit(X).get_feature_names_out(X.columns)
array(['1', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2'], dtype=object)

>>> PolynomialFeatures(degree=(2, 3), interaction_only=True).fit(X).get_feature_names_out(X.columns)
array(['1', 'a b', 'a c', 'b c', 'a b c'], dtype=object)

Someone already pitched a similar idea to scikit-learn here, which ended up in this PR. This change allowed specifying a tuple like this: degree = (min_degree, max_degree), whereas previously one could only specify a degree=int.

Hacky Solution?

I haven't tried this myself, but I guess you could potentially use sklego.preprocessing.ColumnSelector, which selects columns based on column name (I'm taking this out of the README file), apply PolynomialFeatures and then concatenate them with the remaining columns. But I find this unnecessarily complex, having to use three transformers: a column selector, polynomial features, and then a union to concatenate the features back. Also, it assumes the user knows the names of the columns at any point in the pipeline, which doesn't always work, since some scikit-learn transformers, including PolynomialFeatures perform changes to the names of the columns.

According to my humble knowledge, there is no other way to achieve this in an intuitive way with a single transformer.

Proposed Solution

Add n_dimensions parameter to control how many columns from the input should undergo polynomial transformation:

class PolynomialColumnTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, n_dimensions=None, degree=2, interaction_only=False, include_bias=True):
        # n_dimensions: number of columns to transform (from left to right)
        # if None, transforms all columns (default behavior)

Even though you don't specify which columns to use, but the first n_dimensions columns, you could sort the columns yourself first based on the feature importances or domain knowledge, so only the ones you want are used for polynomial expansion. Or one could use this transformer alongside other potential ones that allow you to move columns to the beginning or even sort them based on some feature engineering algorithm. Nevertheless, I'm open to suggestions to make this even more general. I already have the code for this one implemented, as well as very thorough tests to make sure nothing breaks in-between, so I could take this as a first issue, if you find it useful.

@CamiloMartinezM CamiloMartinezM added the enhancement New feature or request label Nov 9, 2024
@FBruzzesi
Copy link
Collaborator

FBruzzesi commented Nov 10, 2024

Hey @CamiloMartinezM thanks for the very detailed explanation.

To me the proposed solution seems something which is already possible via sklearn.compose.ColumnTransformer in which you can specify either the list of column names or column position to transform, and decide what to do with the remaining columns:

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer

X = pd.DataFrame(np.random.randn(100, 3), columns=list("abc"))

cols = ["a", "b"] # or list(range(n_dimensions))

n_dim_polynomial = ColumnTransformer(
    transformers = [
        ("poly", PolynomialFeatures(degree=2), cols)
    ],
    remainder="passthrough"  # no needs to do union with the other columns, they are just passed through as they are
).set_output(transform="pandas")

n_dim_polynomial.fit_transform(X).head(2)

Resulting in:

poly__1 poly__a poly__b poly__a^2 poly__a b poly__b^2 remainder__c
0 1 0.299388 -0.633881 0.089633 -0.189776 0.401805 -0.137093
1 1 -0.0791178 1.48079 0.00625962 -0.117157 2.19274 -1.21945

@koaning
Copy link
Owner

koaning commented Nov 10, 2024

Another alternative is to use the ColumnSelector in this project.

pipe = make_pipeline(
    ColumnSelector(['a', 'b']),
    PolynomialFeatures()
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants