Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Adds selective hyperparameter optimization #58

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

itellaetxe
Copy link
Contributor

Summary

Users can now specify which parameters of the selected model can be hyperoptimized for. This way, for models with many hyperparameters, the search space is reduced a lot, whereas previously we would train $n^g$ models ($n:$ number of model parameters, $g:$ Hyperparameter grid points). Also fixes double-dipping problems in model fitting.

Key Changes

  • Addition of selection of hyperparameters to optimize in the CLI and its interactive version.
  • the hyperopt model now trains on a dataset split instead of CV due to hyperopt-sklearn library limitations. The testing set indices for each model (dependent on covariates and systems) are returned in CSV files under the ageml/model_age directory.
  • AgeML.fit_age method now has 2 train cases instead of 3, making it more concise and efficient.
  • Modified the help message of the -ht argument.
  • Adapted tests of interactive CLI for the new hyperparameter selection.

Other changes

  • Minor typo corrections
  • rng from the ageml.datasets module is used in the ui module to do a index random permutation, instead of using np.random, which is discouraged
  • MAE $\rightarrow$ mae

Also fixes double-dipping problems in model fitting
@itellaetxe itellaetxe linked an issue Oct 16, 2024 that may be closed by this pull request
Copy link
Contributor

@JGarciaCondado JGarciaCondado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part LGTM. Minor changes are required to ensure edge cases are well handeld and that users understand the useag of hyperparameter optimization.

err_msg = (
"Hyperparameter tuning parameters must be in the format "
"param1=value1_low,value1_high param2=value2_low, value2_high..."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also check that always two values given a low and a high. What happens if someone gives C=1,2,3? this should through an error. As the user should write C=1,3 and ht=3 to have 3 hyperparameter points 1,2,3. Also ht should be a minimum of 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about hyperparameters like kernels? where you want to choose different kernels. This should not be affected by ht.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed! thanks for this

"a model. The parameter ranges are predefined. An integer is required.\n"
"(e.g. -ht 100 / --hyperparameter_tuning 100)"
"a model, and parameter ranges to sample from. An integer is required, followed \n"
"by the parameters to optimize. (e.g. -ht 100 C=1,2,3 kernel=linear,rbf)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be only C=1,2? A low and a high value? Also how do you deal with if you put 3 kerrnel types. Also 100 hyperparameter grid points is unrealistic. Too many may lead users too put too many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, changed the C=1,2,3 to C=1,3. Also changed 100 to 10.
Working on dealing with string hyperparams.

for hyperparam_name in list(self.hyperparameter_params.keys()):
low, high = self.hyperparameter_params[hyperparam_name]
if hyperparam_types[hyperparam_name] == "log":
param_grid[f"model__{hyperparam_name}"] = np.logspace(low, high, int(self.hyperparameter_tuning))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you telling the susers what are logs, ints etc...? Like if I choose C=-3,3 how do i know if it is using log? Or should I have to write C=0.001, 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it in the documentation. And a link to the documentation explaining this.

self.pipeline.set_params(**opt_pipeline.best_params_)
if self.model_type != "hyperopt":
# Optimize hyperparameters if required
if self.hyperparameter_tuning > 0:
Copy link
Contributor

@JGarciaCondado JGarciaCondado Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyperparameter tuning should be at least 2 points? will give an error if you have C=1,2 and only have one point. Will it choose C=1 or C=2?

Copy link
Contributor Author

@itellaetxe itellaetxe Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Missed this, changing it to 2.

print("Test: MAE %.2f ± %.2f, RMSE %.2f ± %.2f, R2 %.3f ± %.3f, p %.3f ± %.3f" % tuple(summary_test))

# Select the best pipeline based on the mean scores
best_hyperparam_index = np.argmin(mae_means_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause memory problems because you are keeping all the predicted ages and corrected ages until you have trained all the models. You should keep all the scores of all the different parameter grids but at each iteration of the grid keep the predicted and corrected ages only if the model is better than what it was before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, solving it


if self.args.model_type == "hyperopt":
# Split the data in training and test sets
# TODO: Substitute 0.2 factor by the user CLI argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this left as a todo instead of actually implementing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time

src/ageml/ui.py Outdated
@@ -1761,7 +1800,9 @@ def model_age_command(self):
print("Hyperparameter tuning. Number of points in grid search: (Default: 0)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoudl the default if we are actually running hypeparameter tunning not be at least 2? Something maybe like 5? Is not it should be saying hyperparameter tunning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, missed this

test_indices = indices[-test_size:]

else:
train_indices = list(range(X.shape[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this train and test indices? I see that they are the whole data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason behind it is to avoid writing if flow control statements every time I need to split the dataset depending on the model selection ('hyperopt' or not). It is an efficiency thing

Only the results of thhe best model are stored, to avoid using memory for predicted ages that would be discarded in the future anyways. Also adds link to documentation explaining the 'log' 'int' 'float' tthing of hyperparameter range specification
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Train, test and validation in modelling
2 participants