Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

JohannesWiesner · 2024-12-11T15:27:48Z

Hi James, this might be related to #150. I would like to use GridSearchCV in combination with GRCCA but I cannot find a way to pass the feature groups over to the .fit() method of GRCCA.

Currently I am getting:

ValueError: 
All the 40 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 886, in _fit_and_score
    estimator.fit(X_train, **fit_params)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/pipeline.py", line 468, in fit
    routed_params = self._check_method_params(method="fit", props=params)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/pipeline.py", line 374, in _check_method_params
    fit_params_steps[step]["fit"][param] = pval
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/utils/_bunch.py", line 39, in __getitem__
    return super().__getitem__(key)
KeyError: 'grcca'

Here's some example code:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from cca_zoo.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from cca_zoo.preprocessing import MultiViewPreprocessing
from sklearn.preprocessing import StandardScaler
from cca_zoo.linear import GRCCA

###############################################################################
## Simulate Data: Not part of the question
###############################################################################

# Random state for reproducibility
rng = np.random.RandomState(42)

# Parameters
n_samples = 100
n_features_X = 100
n_features_Y = 10
latent_correlation = 0.6

# Generate a latent variable
latent_dim = 1
latent_variable = rng.randn(n_samples, latent_dim)

# Generate X with structured covariance
# Define groups
group_sizes = [50, 25, 25]
group_correlations = [0.8, 0.7, 0.6]
X = np.zeros((n_samples, n_features_X))
current_feature = 0

for group_size, group_corr in zip(group_sizes, group_correlations):
    
    # Generate a group latent variable
    group_latent = latent_variable + rng.randn(n_samples, 1) * (1 - group_corr)
    
    # Generate group features
    group_features = group_latent @ rng.randn(1, group_size) + rng.randn(n_samples, group_size) * (1 - group_corr)
    X[:, current_feature:current_feature + group_size] = group_features
    current_feature += group_size

# Generate Y based on the latent variable
Y = latent_variable @ rng.randn(1, n_features_Y) + rng.randn(n_samples, n_features_Y) * (1 - latent_correlation)

###############################################################################
## Bring data in nice format: Not part of the question
###############################################################################

subject_ids = [f"subject_{i+1}" for i in range(n_samples)]

# get df_brain
df_brain = pd.DataFrame(X)
df_brain.index = subject_ids
df_brain.index.name = 'subject_id'
X_columns = pd.MultiIndex.from_arrays(
    [
        [f"area_{i+1}" for i in range(100)],  # area_label_idx
        ["network_1"] * 50 + ["network_2"] * 25 + ["network_3"] * 25  # brain_network_idx
    ],
    names=["brain_area","brain_network"]
)
df_brain.columns = X_columns

# get df_behavior
df_behavior = pd.DataFrame(Y)
df_behavior.index = subject_ids
df_behavior.index.name = 'subject_id'
df_behavior.columns = [f"behavioral_variable_{idx+1}" for idx in range(len(df_behavior.columns))]

###############################################################################
## Prepare Analysis: Somehow part of the question?
###############################################################################

# get feature groups: features in df_brain belong to 3 groups, features in df_behavior don't
# have any groups so we set the same number for all features (all features belong to one group)
groups_brain = df_brain.columns.get_level_values('brain_network').astype('category').codes.astype('int64')
groups_behavior = np.array([0 for f in range(len(df_behavior.columns))])
feature_groups = [groups_brain,groups_behavior]

# define latent dimensions
latent_dimensions = 1

# define folds
cv = KFold(5)

# just get numpy arrays
X1 = df_brain.values
X2 = df_behavior.values

###############################################################################
## Actual Question: Run GridSearch with Pipeline that includes Standardization 
## and GRCCA
###############################################################################

# define an estimator
estimator = Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(),StandardScaler()))),
    ('grcca',GRCCA(latent_dimensions=latent_dimensions,random_state=rng))
    ])

# define grid
param_grid = {'grcca__c':[[10**x for x in range(-1,1)],[10**x for x in range(-1,1)]],
              'grcca__mu':[[10**x for x in range(-1,1)],[0]]}

# run gridsearch
grid = GridSearchCV(estimator,param_grid,cv=cv)
grid.fit([X1,X2],grcca__feature_groups=feature_groups)

The text was updated successfully, but these errors were encountered:

JohannesWiesner · 2024-12-11T15:57:40Z

By the way, I am also not sure how to correctly provide feature_groups. Should this be a list of lists (one list for each group)? What if one of the views does not have any feature groups?

See:

https://cca-zoo.readthedocs.io/en/latest/modules/generated/cca_zoo.linear.GRCCA.html#cca_zoo.linear.GRCCA.fit

JohannesWiesner · 2025-01-20T15:04:08Z

Created a separate issue for JointData:

#204

JohannesWiesner · 2025-01-20T15:10:57Z

Found the solution. .fit() of GridSearchCV has to be called like that

grid.fit([X1,X2],estimator__grcca__feature_groups=feature_groups)

JohannesWiesner · 2025-01-20T15:22:17Z

Openend another issue for how to define feature_groups:

#205

JohannesWiesner · 2025-01-20T15:22:51Z

Found the solution. .fit() of GridSearchCV has to be called like that

grid.fit([X1,X2],estimator__grcca__feature_groups=feature_groups)

@jameschapman19 Can you confirm this is the right way to set up

GridSearch + Pipeline + GRCCA?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

JohannesWiesner commented Dec 11, 2024 •

edited

Loading

JohannesWiesner commented Dec 11, 2024

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Jan 20, 2025

Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

Comments

JohannesWiesner commented Dec 11, 2024 • edited Loading

JohannesWiesner commented Dec 11, 2024

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Jan 20, 2025

JohannesWiesner commented Dec 11, 2024 •

edited

Loading