[FEATURE] PolynomialColumnTransformer with maximum number of columns, n_dimensions
#715
Labels
enhancement
New feature or request
n_dimensions
#715
Description
Add the ability to specify the maximum number of dimensions to include in the polynomial features combinations generated by
sklearn.preprocessing.PolynomialFeatures
, allowing users to apply polynomial transformations to only the firstN
columns while keeping remaining columns unchanged.Use Case
p_value
, for instance, after a statistical test. And often only the first few features would need polynomial expansion while later features should remain untouched. Or simply to reduce computational overhead.[0, 1]
values, doing polynomial expansion on these doesn't make sense and increases the number of columns of our dataset unnecessarily.I found myself coding this
Transformer
myself to be able to use it in ascikit-learn
pipeline that would preprocess ~100 features in a Healthcare dataset, which quickly blew up in terms of the number of output columns, when applyingPolynomialFeatures
withdegree <= 3
.Current Behavior
Currently,
PolynomialFeatures
transforms all input columns. We can specify adegree = (min_degree, max_degree)
and ainteraction_only=True or False
, to limit the combinations. For example,Someone already pitched a similar idea to scikit-learn here, which ended up in this PR. This change allowed specifying a
tuple
like this:degree = (min_degree, max_degree)
, whereas previously one could only specify adegree=int
.Hacky Solution?
I haven't tried this myself, but I guess you could potentially use
sklego.preprocessing.ColumnSelector
, which selects columns based on column name (I'm taking this out of theREADME
file), applyPolynomialFeatures
and then concatenate them with the remaining columns. But I find this unnecessarily complex, having to use three transformers: a column selector, polynomial features, and then a union to concatenate the features back. Also, it assumes the user knows the names of the columns at any point in the pipeline, which doesn't always work, since some scikit-learn transformers, includingPolynomialFeatures
perform changes to the names of the columns.According to my humble knowledge, there is no other way to achieve this in an intuitive way with a single transformer.
Proposed Solution
Add
n_dimensions
parameter to control how many columns from the input should undergo polynomial transformation:Even though you don't specify which columns to use, but the first
n_dimensions
columns, you could sort the columns yourself first based on the feature importances or domain knowledge, so only the ones you want are used for polynomial expansion. Or one could use this transformer alongside other potential ones that allow you to move columns to the beginning or even sort them based on some feature engineering algorithm. Nevertheless, I'm open to suggestions to make this even more general. I already have the code for this one implemented, as well as very thorough tests to make sure nothing breaks in-between, so I could take this as a first issue, if you find it useful.The text was updated successfully, but these errors were encountered: