Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

Open
JohannesWiesner opened this issue Dec 11, 2024 · 5 comments

Comments

@JohannesWiesner
Copy link
Contributor

JohannesWiesner commented Dec 11, 2024

Hi James, this might be related to #150. I would like to use GridSearchCV in combination with GRCCA but I cannot find a way to pass the feature groups over to the .fit() method of GRCCA.

Currently I am getting:

ValueError: 
All the 40 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 886, in _fit_and_score
    estimator.fit(X_train, **fit_params)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/pipeline.py", line 468, in fit
    routed_params = self._check_method_params(method="fit", props=params)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/pipeline.py", line 374, in _check_method_params
    fit_params_steps[step]["fit"][param] = pval
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/utils/_bunch.py", line 39, in __getitem__
    return super().__getitem__(key)
KeyError: 'grcca'

Here's some example code:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from cca_zoo.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from cca_zoo.preprocessing import MultiViewPreprocessing
from sklearn.preprocessing import StandardScaler
from cca_zoo.linear import GRCCA

###############################################################################
## Simulate Data: Not part of the question
###############################################################################

# Random state for reproducibility
rng = np.random.RandomState(42)

# Parameters
n_samples = 100
n_features_X = 100
n_features_Y = 10
latent_correlation = 0.6

# Generate a latent variable
latent_dim = 1
latent_variable = rng.randn(n_samples, latent_dim)

# Generate X with structured covariance
# Define groups
group_sizes = [50, 25, 25]
group_correlations = [0.8, 0.7, 0.6]
X = np.zeros((n_samples, n_features_X))
current_feature = 0

for group_size, group_corr in zip(group_sizes, group_correlations):
    
    # Generate a group latent variable
    group_latent = latent_variable + rng.randn(n_samples, 1) * (1 - group_corr)
    
    # Generate group features
    group_features = group_latent @ rng.randn(1, group_size) + rng.randn(n_samples, group_size) * (1 - group_corr)
    X[:, current_feature:current_feature + group_size] = group_features
    current_feature += group_size

# Generate Y based on the latent variable
Y = latent_variable @ rng.randn(1, n_features_Y) + rng.randn(n_samples, n_features_Y) * (1 - latent_correlation)

###############################################################################
## Bring data in nice format: Not part of the question
###############################################################################

subject_ids = [f"subject_{i+1}" for i in range(n_samples)]

# get df_brain
df_brain = pd.DataFrame(X)
df_brain.index = subject_ids
df_brain.index.name = 'subject_id'
X_columns = pd.MultiIndex.from_arrays(
    [
        [f"area_{i+1}" for i in range(100)],  # area_label_idx
        ["network_1"] * 50 + ["network_2"] * 25 + ["network_3"] * 25  # brain_network_idx
    ],
    names=["brain_area","brain_network"]
)
df_brain.columns = X_columns

# get df_behavior
df_behavior = pd.DataFrame(Y)
df_behavior.index = subject_ids
df_behavior.index.name = 'subject_id'
df_behavior.columns = [f"behavioral_variable_{idx+1}" for idx in range(len(df_behavior.columns))]

###############################################################################
## Prepare Analysis: Somehow part of the question?
###############################################################################

# get feature groups: features in df_brain belong to 3 groups, features in df_behavior don't
# have any groups so we set the same number for all features (all features belong to one group)
groups_brain = df_brain.columns.get_level_values('brain_network').astype('category').codes.astype('int64')
groups_behavior = np.array([0 for f in range(len(df_behavior.columns))])
feature_groups = [groups_brain,groups_behavior]

# define latent dimensions
latent_dimensions = 1

# define folds
cv = KFold(5)

# just get numpy arrays
X1 = df_brain.values
X2 = df_behavior.values

###############################################################################
## Actual Question: Run GridSearch with Pipeline that includes Standardization 
## and GRCCA
###############################################################################

# define an estimator
estimator = Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(),StandardScaler()))),
    ('grcca',GRCCA(latent_dimensions=latent_dimensions,random_state=rng))
    ])

# define grid
param_grid = {'grcca__c':[[10**x for x in range(-1,1)],[10**x for x in range(-1,1)]],
              'grcca__mu':[[10**x for x in range(-1,1)],[0]]}

# run gridsearch
grid = GridSearchCV(estimator,param_grid,cv=cv)
grid.fit([X1,X2],grcca__feature_groups=feature_groups)
@JohannesWiesner
Copy link
Contributor Author

By the way, I am also not sure how to correctly provide feature_groups. Should this be a list of lists (one list for each group)? What if one of the views does not have any feature groups?

See:

https://cca-zoo.readthedocs.io/en/latest/modules/generated/cca_zoo.linear.GRCCA.html#cca_zoo.linear.GRCCA.fit

@JohannesWiesner
Copy link
Contributor Author

Created a separate issue for JointData:

#204

@JohannesWiesner
Copy link
Contributor Author

Found the solution. .fit() of GridSearchCV has to be called like that

grid.fit([X1,X2],estimator__grcca__feature_groups=feature_groups)

@JohannesWiesner
Copy link
Contributor Author

Openend another issue for how to define feature_groups:

#205

@JohannesWiesner
Copy link
Contributor Author

Found the solution. .fit() of GridSearchCV has to be called like that

grid.fit([X1,X2],estimator__grcca__feature_groups=feature_groups)

@jameschapman19 Can you confirm this is the right way to set up

GridSearch + Pipeline + GRCCA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant