ENH: Adds selective hyperparameter optimization #58

itellaetxe · 2024-10-16T11:04:47Z

Summary

Users can now specify which parameters of the selected model can be hyperoptimized for. This way, for models with many hyperparameters, the search space is reduced a lot, whereas previously we would train $n^g$ models ($n:$ number of model parameters, $g:$ Hyperparameter grid points). Also fixes double-dipping problems in model fitting.

Key Changes

Addition of selection of hyperparameters to optimize in the CLI and its interactive version.
the hyperopt model now trains on a dataset split instead of CV due to hyperopt-sklearn library limitations. The testing set indices for each model (dependent on covariates and systems) are returned in CSV files under the ageml/model_age directory.
AgeML.fit_age method now has 2 train cases instead of 3, making it more concise and efficient.
Modified the help message of the -ht argument.
Adapted tests of interactive CLI for the new hyperparameter selection.

Other changes

Minor typo corrections
rng from the ageml.datasets module is used in the ui module to do a index random permutation, instead of using np.random, which is discouraged
MAE $\rightarrow$ mae

Also fixes double-dipping problems in model fitting

JGarciaCondado

For the most part LGTM. Minor changes are required to ensure edge cases are well handeld and that users understand the useag of hyperparameter optimization.

JGarciaCondado · 2024-10-16T12:17:18Z

src/ageml/commands.py

+                    err_msg = (
+                        "Hyperparameter tuning parameters must be in the format "
+                        "param1=value1_low,value1_high param2=value2_low, value2_high..."
+                    )


should also check that always two values given a low and a high. What happens if someone gives C=1,2,3? this should through an error. As the user should write C=1,3 and ht=3 to have 3 hyperparameter points 1,2,3. Also ht should be a minimum of 2.

What about hyperparameters like kernels? where you want to choose different kernels. This should not be affected by ht.

Addressed! thanks for this

JGarciaCondado · 2024-10-16T12:19:28Z

src/ageml/messages.py

-    "a model. The parameter ranges are predefined. An integer is required.\n"
-    "(e.g.   -ht 100   /   --hyperparameter_tuning 100)"
+    "a model, and parameter ranges to sample from. An integer is required, followed \n"
+    "by the parameters to optimize. (e.g.   -ht 100  C=1,2,3  kernel=linear,rbf)"


Should this not be only C=1,2? A low and a high value? Also how do you deal with if you put 3 kerrnel types. Also 100 hyperparameter grid points is unrealistic. Too many may lead users too put too many.

True, changed the C=1,2,3 to C=1,3. Also changed 100 to 10.
Working on dealing with string hyperparams.

JGarciaCondado · 2024-10-16T12:22:21Z

src/ageml/modelling.py

+            for hyperparam_name in list(self.hyperparameter_params.keys()):
+                low, high = self.hyperparameter_params[hyperparam_name]
+                if hyperparam_types[hyperparam_name] == "log":
+                    param_grid[f"model__{hyperparam_name}"] = np.logspace(low, high, int(self.hyperparameter_tuning))


Where are you telling the susers what are logs, ints etc...? Like if I choose C=-3,3 how do i know if it is using log? Or should I have to write C=0.001, 100?

I will add it in the documentation. And a link to the documentation explaining this.

JGarciaCondado · 2024-10-16T12:23:31Z

src/ageml/modelling.py

-            self.pipeline.set_params(**opt_pipeline.best_params_)
+        if self.model_type != "hyperopt":
+            # Optimize hyperparameters if required
+            if self.hyperparameter_tuning > 0:


hyperparameter tuning should be at least 2 points? will give an error if you have C=1,2 and only have one point. Will it choose C=1 or C=2?

True. Missed this, changing it to 2.

JGarciaCondado · 2024-10-16T12:26:20Z

src/ageml/modelling.py

+                print("Test: MAE %.2f ± %.2f, RMSE %.2f ± %.2f, R2 %.3f ± %.3f, p %.3f ± %.3f" % tuple(summary_test))
+
+            # Select the best pipeline based on the mean scores
+            best_hyperparam_index = np.argmin(mae_means_test)


This can cause memory problems because you are keeping all the predicted ages and corrected ages until you have trained all the models. You should keep all the scores of all the different parameter grids but at each iteration of the grid keep the predicted and corrected ages only if the model is better than what it was before.

True, solving it

JGarciaCondado · 2024-10-16T12:27:43Z

src/ageml/ui.py

+
+        if self.args.model_type == "hyperopt":
+            # Split the data in training and test sets
+            # TODO: Substitute 0.2 factor by the user CLI argument


Why is this left as a todo instead of actually implementing it?

JGarciaCondado · 2024-10-16T12:28:44Z

src/ageml/ui.py

@@ -1761,7 +1800,9 @@ def model_age_command(self):
        print("Hyperparameter tuning. Number of points in grid search: (Default: 0)")


Shoudl the default if we are actually running hypeparameter tunning not be at least 2? Something maybe like 5? Is not it should be saying hyperparameter tunning.

True, missed this

JGarciaCondado · 2024-10-16T12:29:42Z

src/ageml/ui.py

+            test_indices = indices[-test_size:]
+
+        else:
+            train_indices = list(range(X.shape[0]))


why do we need this train and test indices? I see that they are the whole data structure.

The reason behind it is to avoid writing if flow control statements every time I need to split the dataset depending on the model selection ('hyperopt' or not). It is an efficiency thing

Only the results of thhe best model are stored, to avoid using memory for predicted ages that would be discarded in the future anyways. Also adds link to documentation explaining the 'log' 'int' 'float' tthing of hyperparameter range specification

ENH: Adds selective hyperparameter optimization

b309449

Also fixes double-dipping problems in model fitting

itellaetxe requested a review from JGarciaCondado October 16, 2024 11:04

itellaetxe linked an issue Oct 16, 2024 that may be closed by this pull request

Train, test and validation in modelling #31

Open

JGarciaCondado reviewed Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Adds selective hyperparameter optimization #58

ENH: Adds selective hyperparameter optimization #58

itellaetxe commented Oct 16, 2024

JGarciaCondado left a comment

JGarciaCondado Oct 16, 2024

JGarciaCondado Oct 16, 2024

itellaetxe Oct 21, 2024

JGarciaCondado Oct 16, 2024

itellaetxe Oct 21, 2024

JGarciaCondado Oct 16, 2024

itellaetxe Oct 21, 2024

JGarciaCondado Oct 16, 2024 •

edited by itellaetxe

Loading

itellaetxe Oct 21, 2024 •

edited

Loading

JGarciaCondado Oct 16, 2024

itellaetxe Oct 21, 2024

JGarciaCondado Oct 16, 2024

itellaetxe Oct 17, 2024

JGarciaCondado Oct 16, 2024

itellaetxe Oct 17, 2024

JGarciaCondado Oct 16, 2024

itellaetxe Oct 17, 2024

		@@ -1761,7 +1800,9 @@ def model_age_command(self):
		print("Hyperparameter tuning. Number of points in grid search: (Default: 0)")

ENH: Adds selective hyperparameter optimization #58

Are you sure you want to change the base?

ENH: Adds selective hyperparameter optimization #58

Conversation

itellaetxe commented Oct 16, 2024

Summary

Key Changes

Other changes

JGarciaCondado left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGarciaCondado Oct 16, 2024 • edited by itellaetxe Loading

Choose a reason for hiding this comment

itellaetxe Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGarciaCondado Oct 16, 2024 •

edited by itellaetxe

Loading

itellaetxe Oct 21, 2024 •

edited

Loading