From 8da61a4ebd438c445cad321d259712d71b32df31 Mon Sep 17 00:00:00 2001 From: Caparrini Date: Tue, 5 Nov 2024 20:11:49 +0100 Subject: [PATCH] Refactored and updated User Guide doc --- docs/conf.py | 9 + docs/sections/Advanced/index.rst | 23 +++ .../{Concepts => Advanced}/parallel.rst | 0 .../reproducibility.rst | 0 .../score_functions.rst | 0 docs/sections/Basics/directory_structure.rst | 47 ----- docs/sections/Basics/index.rst | 20 --- docs/sections/Basics/overview.rst | 163 ------------------ docs/sections/Concepts/hyperparam.rst | 127 -------------- docs/sections/Concepts/index.rst | 43 ----- docs/sections/Introduction/features.rst | 23 +++ docs/sections/Introduction/index.rst | 17 ++ docs/sections/Introduction/overview.rst | 132 ++++++++++++++ docs/sections/Quickstart/index.rst | 25 +++ docs/sections/Quickstart/step1.rst | 120 +++++++++++++ docs/sections/Quickstart/step2.rst | 134 ++++++++++++++ docs/sections/Quickstart/step3.rst | 105 +++++++++++ docs/sections/Quickstart/step4.rst | 136 +++++++++++++++ docs/sections/Results/directory_structure.rst | 49 ++++++ docs/sections/Results/evolution_graph.rst | 41 +++++ docs/sections/Results/index.rst | 21 +++ docs/sections/Results/search_space_graph.rst | 51 ++++++ docs/sections/introduction.rst | 103 ----------- docs/sections/user_guide.rst | 7 +- 24 files changed, 890 insertions(+), 506 deletions(-) create mode 100644 docs/sections/Advanced/index.rst rename docs/sections/{Concepts => Advanced}/parallel.rst (100%) rename docs/sections/{Concepts => Advanced}/reproducibility.rst (100%) rename docs/sections/{Concepts => Advanced}/score_functions.rst (100%) delete mode 100644 docs/sections/Basics/directory_structure.rst delete mode 100644 docs/sections/Basics/index.rst delete mode 100644 docs/sections/Basics/overview.rst delete mode 100644 docs/sections/Concepts/hyperparam.rst delete mode 100644 docs/sections/Concepts/index.rst create mode 100644 docs/sections/Introduction/features.rst create mode 100644 docs/sections/Introduction/index.rst create mode 100644 docs/sections/Introduction/overview.rst create mode 100644 docs/sections/Quickstart/index.rst create mode 100644 docs/sections/Quickstart/step1.rst create mode 100644 docs/sections/Quickstart/step2.rst create mode 100644 docs/sections/Quickstart/step3.rst create mode 100644 docs/sections/Quickstart/step4.rst create mode 100644 docs/sections/Results/directory_structure.rst create mode 100644 docs/sections/Results/evolution_graph.rst create mode 100644 docs/sections/Results/index.rst create mode 100644 docs/sections/Results/search_space_graph.rst delete mode 100644 docs/sections/introduction.rst diff --git a/docs/conf.py b/docs/conf.py index fc91407..48e9a28 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -51,6 +51,15 @@ intersphinx_mapping = { 'python': ('https://docs.python.org/3', None), 'numpy': ('https://numpy.org/doc/stable/', None), + 'scikit-learn': ('https://scikit-learn.org/stable/', None), + 'mlflow': ('https://www.mlflow.org/docs/latest/', None), + 'xgboot': ('https://xgboost.readthedocs.io/en/latest/', None), + 'lightgbm': ('https://lightgbm.readthedocs.io/en/latest/', None), + 'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None), + 'matplotlib': ('https://matplotlib.org/stable/', None), + 'seaborn': ('https://seaborn.pydata.org/', None), + 'scipy': ('https://docs.scipy.org/doc/scipy/reference/', None), + 'deap': ('https://deap.readthedocs.io/en/master/', None), } templates_path = ['_templates'] diff --git a/docs/sections/Advanced/index.rst b/docs/sections/Advanced/index.rst new file mode 100644 index 0000000..19e8513 --- /dev/null +++ b/docs/sections/Advanced/index.rst @@ -0,0 +1,23 @@ +Advanced Customization +====================== + +The advanced customization options in `mloptimizer` enable fine-tuning of the optimization process, providing flexibility to adapt to different scenarios, computational resources, and evaluation needs. Use these options to define custom scoring metrics, ensure reproducibility, or leverage parallel processing for faster optimization. + +.. toctree:: + :hidden: + + score_functions + reproducibility + parallel + + +Overview of Customization Options +--------------------------------- + +- **Custom Score Functions**: Define custom scoring metrics tailored to your specific objectives. This flexibility allows you to optimize models based on metrics beyond standard evaluation scores, aligning with unique project requirements. + +- **Reproducibility**: Ensure consistent results by setting seeds and managing randomization across optimization runs. Reproducibility is essential for benchmarking and validating models in research and production environments. + +- **Parallel Processing**: Accelerate optimization by distributing computations across multiple cores. Parallel processing can significantly reduce runtime, especially for complex models or extensive hyperparameter spaces. + +Each section provides detailed guidance on implementing these advanced options. diff --git a/docs/sections/Concepts/parallel.rst b/docs/sections/Advanced/parallel.rst similarity index 100% rename from docs/sections/Concepts/parallel.rst rename to docs/sections/Advanced/parallel.rst diff --git a/docs/sections/Concepts/reproducibility.rst b/docs/sections/Advanced/reproducibility.rst similarity index 100% rename from docs/sections/Concepts/reproducibility.rst rename to docs/sections/Advanced/reproducibility.rst diff --git a/docs/sections/Concepts/score_functions.rst b/docs/sections/Advanced/score_functions.rst similarity index 100% rename from docs/sections/Concepts/score_functions.rst rename to docs/sections/Advanced/score_functions.rst diff --git a/docs/sections/Basics/directory_structure.rst b/docs/sections/Basics/directory_structure.rst deleted file mode 100644 index ed588c0..0000000 --- a/docs/sections/Basics/directory_structure.rst +++ /dev/null @@ -1,47 +0,0 @@ -============================= -Optimizer Directory Structure -============================= - -When an optimizer is run, it generates a directory in the current working directory (or the given directory as input). -This directory, named in the format `YYYYMMDD_nnnnnnnnnn_OptimizerName`, contains the results of the optimization process, -including the best estimator found, a log file detailing the optimization steps, and the final result of the optimization. - -Directory Structure -------------------- -The directory structure is as follows: - -.. code-block:: bash - - ├── checkpoints - │ ├── cp_gen_0.pkl - │ └── cp_gen_1.pkl - ├── graphics - │ ├── logbook.html - │ └── search_space.html - ├── opt.log - ├── progress - │ ├── Generation_0.csv - │ └── Generation_1.csv - └── results - ├── logbook.csv - └── populations.csv - -Directory Contents ------------------- -Each item in the directory serves a specific purpose: - -- `checkpoints`: Contains the checkpoint files for each generation of the genetic optimization process. These files preserve the state of the optimization process at each generation, enabling the process to be resumed from a specific point if necessary. - - `cp_gen_0.pkl`, `cp_gen_1.pkl`: These are the individual checkpoint files for each generation. They are named according to the generation number and are saved in Python's pickle format. - -- `graphics`: Contains HTML files for visualizing the optimization process. - - `logbook.html`: Provides a graphical representation of the logbook, which records the statistics of the optimization process over generations. - - `search_space.html`: Provides a graphical representation of the search space of the optimization process. - -- `opt.log`: The log file for the optimization process. It contains detailed logs of the optimization process, including the performance of the algorithm at each generation. - -- `progress`: Contains CSV files that record the progress of the optimization process for each generation. - - `Generation_0.csv`, `Generation_1.csv`: These are the individual progress files for each generation. They contain detailed information about each individual in the population at each generation. - -- `results`: Contains CSV files with the results of the optimization process. - - `logbook.csv`: This file is a CSV representation of the logbook, which records the statistics of the optimization process over generations. - - `populations.csv`: This file contains the final populations of the optimization process. It includes the hyperparameters and fitness values of each individual in the population. diff --git a/docs/sections/Basics/index.rst b/docs/sections/Basics/index.rst deleted file mode 100644 index 2a21ff7..0000000 --- a/docs/sections/Basics/index.rst +++ /dev/null @@ -1,20 +0,0 @@ -Basics -================== - -The BaseOptimizer is an abstract base class that provides -the fundamental structure for all optimizers in the mloptimizer package. -It is designed to optimize a classifier using a genetic algorithm. -The class includes methods for setting up the optimization process, -defining the hyperparameters to be optimized, and running the optimization. - -The BaseOptimizer class is designed to be subclassed. -mloptimizer provides several subclasses of the BaseOptimizer class. - - - -.. toctree:: - :hidden: - - overview - directory_structure - diff --git a/docs/sections/Basics/overview.rst b/docs/sections/Basics/overview.rst deleted file mode 100644 index 0e5c3e6..0000000 --- a/docs/sections/Basics/overview.rst +++ /dev/null @@ -1,163 +0,0 @@ -========================= -Overview -========================= - -Introduction ------------- -The main class objects are the `Optimizer` and the `HyperparameterSpace` classes. - -The optimizer `Optimizer` is able to optimize any model that complies with the `sklearn` API. -The `HyperparameterSpace` class is used to define the hyperparameters that will be optimized, either -the fixed hyperparameters or the hyperparameters that will be optimized. - -Usage ------ -To use the `Optimizer` class: - -1. Define your features and labels. -2. Choose a model to optimize that complies with the `sklearn` API. (e.g. `XGBClassifier`). -3. Create an instance of `HyperparameterSpace` with the hyperparameters that you want to optimize. -4. Call the `optimize_clf()` method to start the optimization process. - -.. note:: - There are default HyperparameterSpaces defined in the ``conf`` folder for the most common models. - You can use the HyperparameterSpace.get_default_hyperparams(class) (class e.g. XGBClassifier). - -There are several parameters than can be passed to the `Optimizer` constructor: - -- `estimator_class`: The class of the model to optimize. It should comply with the `sklearn` API. -- `X`: The features of your dataset. -- `y`: The labels of your dataset. -- `folder`: The folder where the files and folder will be saved. Defaults to the current directory. -- `log_file`: The name of the log file. Defaults to `mloptimizer.log`. -- `hyperparam_space`: The hyperparameter space to use for the optimization process. -- `eval_function`: The function to use to evaluate the model. Defaults to `train_score`. -- `score_function`: The function to use to score the model. Defaults to `accuracy_score`. -- `seed`: The seed to use for reproducibility. Defaults to a random integer between 0 and 1000000. - - -Default Usage Example ---------------------- - -The simplest example of using the Optimizer is: - -- Store your features and labels in `X` and `y` respectively. -- Use HyperparameterSpace.get_default_hyperparams(XGBClassifier) to get the default hyperparameters for the model you want to optimize. -- Create an instance of `Optimizer` with your classifier class, hyperparameter space, data and leave all other parameters to their default values. -- Call the `optimize_clf()` method to start the optimization process. You can pass the population size and the number of generations to the method. -- The result of the optimization process will be a object of type XGBClassifier with the best hyperparameters found. - -.. code-block:: python - - from mloptimizer.application import Optimizer - from mloptimizer.domain.hyperspace import HyperparameterSpace - from xgboost import XGBClassifier - from sklearn.datasets import load_iris - - # 1) Load the dataset and get the features and target - X, y = load_iris(return_X_y=True) - - # 2) Define the hyperparameter space (a default space is provided for some algorithms) - hyperparameter_space = HyperparameterSpace.get_default_hyperparameter_space(XGBClassifier) - - # 3) Create the optimizer and optimize the classifier - opt = Optimizer(estimator_class=XGBClassifier, features=X, labels=y, hyperparam_space=hyperparameter_space) - - clf = opt.optimize_clf(10, 10) - -This will create a folder (in the current location) with name `YYYYMMDD_nnnnnnnnnn_Optimizer` -(where `YYYYMMDD_nnnnnnnnnn` is the current timestamp) and a log file named `mloptimizer.log`. -To inspect the structure of the folder and what can you find in it, please refer to the `Folder Structure` section. - -Custom HyperparameterSpace Example ----------------------------------- - -Among the parameters that can be passed to the `Optimizer` constructor, -the `hyperaram_space` of class `HyperparameterSpace` is really important -and should be aligned with the machine learning algorithm passed to the Optimizer: `fixed_hyperparams` -and `evolvable_hyperparams`. - -The `evolvable_hyperparams` parameter is a dictionary of custom hyperparameters. -The key of each hyperparameter is the name of the hyperparameter, and the value is the `Hyperparam` object itself. -To understand how to use the `Hyperparam` object, please refer to the `Hyperparam` section inside Concepts. - -The `fixed_hyperparams` parameter is a dictionary of fixed hyperparameters. -This is simply a dictionary where the key is the name of the hyperparameter, and the value is the value of the hyperparameter. -These hyperparameters will not be optimized, but will be used as fixed values during the optimization process. - -An example of using custom hyperparameters is: - -.. code-block:: python - - from mloptimizer.domain.hyperspace import Hyperparam, HyperparameterSpace - # Define your custom hyperparameters - fixed_hyperparams = { - 'max_depth': 5 - } - evolvable_hyperparams = { - 'colsample_bytree': Hyperparam("colsample_bytree", 3, 10, 'float', 10), - 'gamma': Hyperparam("gamma", 0, 20, 'int'), - 'learning_rate': Hyperparam("learning_rate", 1, 100, 'float', 1000), - # 'max_depth': Hyperparam("max_depth", 3, 20, 'int'), - 'n_estimators': Hyperparam("n_estimators", 100, 500, 'int'), - 'subsample': Hyperparam("subsample", 700, 1000, 'float', 1000), - 'scale_pos_weight': Hyperparam("scale_pos_weight", 15, 40, 'float', 100) - } - - - custom_hyperparam_space = HyperparameterSpace(fixed_hyperparams, evolvable_hyperparams) - - # Create an instance of XGBClassifierOptimizer with custom hyperparameters - xgb_optimizer = Optimizer(estimator_class=XGBClassifier,features=X, labels=y, - hyperparam_space=custom_hyperparam_space) - - # Start the optimization process - result = xgb_optimizer.optimize_clf(3, 3) - - - - - -Both `evolvable_hyperparams` and `fixed_hyperparams` can be used together, -providing several different ways to customize the optimization process. - -Reproducibility ---------------- - -Researchers often need to be able to reproduce their results. During the research process it could be -advisable to run several optimizations processes with different parameters or input data. -However, if the results of the optimization process are not reproducible, it will be difficult to compare -the results of the different optimization processes. -In order to make the results reproducible, the `Optimizer` have a `seed` parameter. -This parameter is used to set the seed of the random number generator used during the optimization process. -If you set the same seed, the results of the optimization process will be the same. - -An example of two executions of the optimization process with the same seed that will produce the same result is: - -.. code-block:: python - - from mloptimizer.application import Optimizer - from mloptimizer.domain.hyperspace import HyperparameterSpace - from xgboost import XGBClassifier - from sklearn.datasets import load_iris - - # 1) Load the dataset and get the features and target - X, y = load_iris(return_X_y=True) - - # 2) Define the hyperparameter space (a default space is provided for some algorithms) - hyperparameter_space = HyperparameterSpace.get_default_hyperparameter_space(XGBClassifier) - - # 3) Create two instances of Optimizer with the same seed - xgb_optimizer1 = Optimizer(estimator_class=XGBClassifier, features=X, labels=y, - hyperparam_space = hyperparameter_space, seed=42) - result1 = xgb_optimizer1.optimize_clf(3, 3) - - xgb_optimizer2 = Optimizer(estimator_class=XGBClassifier, features=X, labels=y, - hyperparam_space = hyperparameter_space, seed=42) - result2 = xgb_optimizer2.optimize_clf(3, 3) - - # Verify that the results are the same - # The comparison is done using the string representation of the result objects - # which are the hyperparameters of the best model found - assert str(result1)== str(result2) - diff --git a/docs/sections/Concepts/hyperparam.rst b/docs/sections/Concepts/hyperparam.rst deleted file mode 100644 index da5aded..0000000 --- a/docs/sections/Concepts/hyperparam.rst +++ /dev/null @@ -1,127 +0,0 @@ -==================== -Hyperparam Class -==================== - -The Hyperparam class is a crucial component of our library, designed to optimize the hyperparameters of machine learning algorithms using the DEAP library, which provides genetic algorithms. - -Why We Need the Hyperparam Class --------------------------------- - -In the context of genetic optimization, a common problem is the repeated evaluation of slightly different individuals. This can lead to inefficiencies in the optimization process. To mitigate this, we use the Hyperparam class to limit the values of the hyperparameters, ensuring that the same individuals are not evaluated multiple times. - -How It Is Used --------------- - -The Hyperparam class is used to define a hyperparameter to optimize. It includes the name, minimum value, maximum value, and type of the hyperparameter. This class also controls the precision of the hyperparameter to avoid multiple evaluations with close values due to decimal positions. - -Depending on the type of hyperparameter, the Hyperparam class can apply a transformation to the value. For example, the 'nexp' type computes the negative power of the value, while the 'x10' type multiplies the value by 10. - -The Hyperparam class has several methods, including: - -- `__init__`: Initializes a new instance of the Hyperparam class. -- `correct`: Returns the real value of the hyperparameter in case some mutation could surpass the limits. -- `__eq__`: Overrides the default implementation to compare two Hyperparam instances. -- `__str__` and `__repr__`: Overrides the default implementations to provide a string representation of the Hyperparam instance. - -Types of Hyperparam -------------------- - -The `Hyperparam` class supports several types of hyperparameters. Here are examples of each type: - -- Integer hyperparameter: - -.. code-block:: python - - hyperparam_int = Hyperparam(name='max_depth', min_value=1, - max_value=10, hyperparam_type='int') - - -- Float hyperparameter: - -.. code-block:: python - - hyperparam_float = Hyperparam(name='learning_rate', min_value=0.01, max_value=1.0, - hyperparam_type='float', scale=100) - - -- 'nexp' hyperparameter: - -.. code-block:: python - - hyperparam_nexp = Hyperparam(name='nexp_param', min_value=1, - max_value=100, hyperparam_type='nexp') - -- 'x10' hyperparameter: - -.. code-block:: python - - hyperparam_x10 = Hyperparam(name='x10_param', min_value=1, - max_value=100, hyperparam_type='x10') - -- 'list' hyperparameter: - -.. code-block:: python - - hyperparam_list = Hyperparam(name='list_param', min_value=0, - max_value=3, hyperparam_type='list', values_str=["a", "b", "c"]) - - hyperparam_list_using_method = Hyperparam.from_list(name='list_param', values_str=["a", "b", "c"]) - -In these examples, we define hyperparameters of different types. The 'nexp' and 'x10' types are special types that apply a transformation to the hyperparameter value. -The 'list' type is used to define a hyperparameter that can take a value from a list of values (usually strings). - -Examples --------- - -Here's an example of how to use the Hyperparam class: - -.. code-block:: python - - # Define a hyperparameter - hyperparam = Hyperparam(name='learning_rate', min_value=0, max_value=1, - hyperparam_type='float', scale=100) - - # Correct a value - # This will return 1.0 as 150 is beyond the max_value - corrected_value = hyperparam.correct(150) - - -In this example, we define a hyperparameter named 'learning_rate' with a minimum value of 0, a maximum value of 1, and a type of float. The 'correct' method is then used to correct a value that is beyond the defined maximum value. - -Here's an example of how you can create a `HyperparameterSpace` instance and pass custom hyperparameters to it: - -.. code-block:: python - - from mloptimizer.domain.hyperspace import Hyperparam, HyperparameterSpace - - # Define custom hyperparameters - fixed_hyperparams = { - "criterion": "gini" - } - evolvable_hyperparams = { - "min_samples_split": Hyperparam("min_samples_split", 2, 50, 'int'), - "min_samples_leaf": Hyperparam("min_samples_leaf", 1, 20, 'int'), - "max_depth": Hyperparam("max_depth", 2, 20, 'int'), - "min_impurity_decrease": Hyperparam("min_impurity_decrease", 0, 150, 'float', 1000), - "ccp_alpha": Hyperparam("ccp_alpha", 0, 300, 'float', 100000) - } - - - # Create a HyperparameterSpace instance - hyperparam_space = HyperparameterSpace(fixed_hyperparams, evolvable_hyperparams) - - # Then we can use the hyperparam_space instance to optimize the hyperparameters - from sklearn.tree import DecisionTreeClassifier - from sklearn.datasets import load_iris - from mloptimizer.application import Optimizer - - # Load the iris dataset - X,y = load_iris(return_X_y=True) - - tree_optimizer = Optimizer(estimator_class=DecisionTreeClassifier, - hyperparam_space=hyperparam_space, - features=X, labels=y) - tree_optimizer.optimize_clf(3, 3) - - -In this example, we define custom hyperparameters and create a `HyperparameterSpace` instance. We then use the `HyperparameterSpace` instance to optimize the hyperparameters of a `DecisionTreeClassifier` using the `Optimizer` class. diff --git a/docs/sections/Concepts/index.rst b/docs/sections/Concepts/index.rst deleted file mode 100644 index 96d285b..0000000 --- a/docs/sections/Concepts/index.rst +++ /dev/null @@ -1,43 +0,0 @@ -Concepts -================== - -Concepts are the building blocks of the hyperparameter optimization -framework. They are used to define the search space and the score function. - -.. mermaid:: - - classDiagram - class Optimizer{ - +estimator_class estimator_class - +HyperparameterSpace hyperspace - +Tracker tracker - +Evaluator evaluator - +IndividualUtils individual_utils - optimize_clf() - } - class HyperparameterSpace{ - +dict fixed_hyperparams - +dict evolvable_hyperparams - from_json() - to_json() - } - class Evaluator{ - evaluate() - evaluate_individual() - } - class IndividualUtils{ - individual2dict() - get_clf() - } - Optimizer "1" --o "1" HyperparameterSpace - Optimizer "1" --o "1" Evaluator - Optimizer "1" --o "1" IndividualUtils - - -.. toctree:: - :hidden: - - hyperparam - score_functions - reproducibility - parallel diff --git a/docs/sections/Introduction/features.rst b/docs/sections/Introduction/features.rst new file mode 100644 index 0000000..4e0dcae --- /dev/null +++ b/docs/sections/Introduction/features.rst @@ -0,0 +1,23 @@ +Features +======== + +`mloptimizer` is designed to streamline hyperparameter optimization for machine learning models by leveraging genetic algorithms. With a flexible and extensible architecture, it integrates seamlessly with the :mod:`scikit-learn` API, making it a valuable tool for both researchers and practitioners looking to improve model performance efficiently. Below is an overview of key and advanced features that make `mloptimizer` a robust choice for hyperparameter tuning. + +Key Features +------------ + +- **User-Friendly**: Intuitive syntax, fully compatible with the :mod:`scikit-learn` API. +- **DEAP-Based Genetic Algorithms**: Built on the :mod:`deap` library, which supports flexible and robust genetic search algorithms. The use of :mod:`deap` provides a foundation for effective evolutionary computation techniques within `mloptimizer`. +- **Predefined and Custom Hyperparameter Spaces**: Includes default hyperparameter spaces for commonly used algorithms, along with options to define custom spaces to suit unique needs. +- **Customizable Score Functions**: Offers default metrics for model evaluation, with the flexibility to add custom scoring functions. +- **Reproducibility and Parallelization**: Ensures reproducible results and supports parallel processing to accelerate optimization tasks. + +Advanced Features +----------------- + +- **Extensibility**: Easily extendable to additional machine learning models that comply with the :class:`Estimator ` class from the :mod:`scikit-learn` API. +- **Custom Hyperparameter Ranges**: Allows users to define specific hyperparameter ranges as needed. +- **MLflow Integration** (Optional): Enables tracking of optimization runs through :mod:`mlflow` for more detailed analysis. +- **Optimization Monitoring**: Provides detailed logs and visualizations to monitor the optimization process. +- **Checkpointing and Resuming**: Supports checkpointing to save the state of the optimization process and resume from a specific point if needed. +- **Search Space Visualization**: Generates visual representations of the search space to aid in understanding the hyperparameter landscape. \ No newline at end of file diff --git a/docs/sections/Introduction/index.rst b/docs/sections/Introduction/index.rst new file mode 100644 index 0000000..38fa957 --- /dev/null +++ b/docs/sections/Introduction/index.rst @@ -0,0 +1,17 @@ +Introduction +================== + +`mloptimizer` is a Python library developed to enhance the performance of machine learning models through the optimization of hyperparameters using genetic algorithms. By applying principles inspired by natural selection, genetic algorithms allow `mloptimizer` to explore large hyperparameter spaces efficiently. This approach not only reduces search time and energy consumption but also achieves results comparable to more computationally demanding search methods. + +This User Guide provides a comprehensive overview of `mloptimizer`’s functionality, setup, and usage. It’s designed for users who are familiar with machine learning libraries like :mod:`scikit-learn` and are looking to incorporate more flexible optimization techniques into their workflows. + +The library’s syntax is intentionally similar to :class:`scikit-learn `'s hyperparameter search tools, but with added layers of customization and control. `mloptimizer` supports a range of popular algorithms, including :class:`DecisionTreeClassifier `, :class:`RandomForestClassifier `, and :class:`XGBClassifier `. Additionally, it is designed to be compatible with other models that follow the :class:`Estimator ` class from the :mod:`scikit-learn` API, allowing for easy integration into existing projects. + +This guide will walk you through everything from installation and quickstart examples to more advanced concepts, customization options, and visualization tools. While `mloptimizer` is designed to be user-friendly, it also offers advanced configuration options for users seeking fine-grained control over their optimization processes. The guide reflects the current functionality of `mloptimizer` and will be updated as the library evolves. + +.. toctree:: + :hidden: + + features + overview + diff --git a/docs/sections/Introduction/overview.rst b/docs/sections/Introduction/overview.rst new file mode 100644 index 0000000..d5e0e7c --- /dev/null +++ b/docs/sections/Introduction/overview.rst @@ -0,0 +1,132 @@ +Overview +========================= + +`mloptimizer` provides a flexible framework for optimizing machine learning models through hyperparameter tuning using genetic algorithms. The primary tools include the :class:`GeneticSearch ` optimizer and the :class:`HyperparameterSpaceBuilder `, which defines the hyperparameters to be tuned. This approach ensures efficient exploration of large parameter spaces, reducing the computational cost of hyperparameter search. + +The :class:`GeneticSearch ` class is compatible with any model that adheres to the :class:`Estimator ` API in :mod:`scikit-learn`, making integration into existing workflows straightforward. + +**Key Components**: + +- **GeneticSearch**: The optimization engine, utilizing genetic algorithms to efficiently search for optimal hyperparameter configurations. +- **HyperparameterSpaceBuilder**: A builder class for defining both fixed and evolvable hyperparameters, supporting a variety of parameter types. This class provides a streamlined, user-friendly approach to constructing hyperparameter spaces. + +Usage +----- + +To get started with `mloptimizer`, follow these main steps: + +1. **Define the Dataset**: Load or prepare your feature matrix (`X`) and target vector (`y`). +2. **Choose a Model**: Select a machine learning model that adheres to the :mod:`scikit-learn` API (e.g., :class:`XGBClassifier `). +3. **Set Up Hyperparameter Space**: Use :class:`HyperparameterSpaceBuilder ` to define the parameters you want to optimize, either by loading a default space or by adding custom hyperparameters. +4. **Run GeneticSearch**: Initialize :class:`GeneticSearch ` with the model and hyperparameter space, then call the `fit` method to start optimization. + +**Using HyperparameterSpaceBuilder**: + +The :class:`HyperparameterSpaceBuilder ` allows for a clean, structured setup of both fixed and evolvable hyperparameters: + +- **Fixed Parameters**: Parameters that remain constant during the optimization process. +- **Evolvable Parameters**: Parameters that the genetic algorithm can adjust to find the best configuration. + +You can add integer, float, and categorical parameters, or load default spaces for commonly used models. + +Basic Usage Example +------------------- + +The following example demonstrates the basic usage of :class:`GeneticSearch ` with :class:`HyperparameterSpaceBuilder ` to optimize an :class:`XGBClassifier ` using default hyperparameters. + +.. code-block:: python + + from mloptimizer.interfaces import GeneticSearch, HyperparameterSpaceBuilder + from xgboost import XGBClassifier + from sklearn.datasets import load_iris + + # Load dataset + X, y = load_iris(return_X_y=True) + + # Define default hyperparameter space + hyperparam_space = HyperparameterSpaceBuilder.get_default_space(XGBClassifier) + + # Initialize GeneticSearch with default space + opt = GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=hyperparam_space, + genetic_params_dict={"generations": 10, "population_size": 20} + ) + + # Run optimization + opt.fit(X, y) + print(opt.best_estimator_) + +This setup leverages the default hyperparameter space for :class:`XGBClassifier ` to start the optimization process immediately. + +Custom HyperparameterSpace Example +---------------------------------- + +For specific tuning needs, use :class:`HyperparameterSpaceBuilder ` to define a custom hyperparameter space with both fixed and evolvable parameters. Here’s an example: + +.. code-block:: python + + from mloptimizer.interfaces import GeneticSearch, HyperparameterSpaceBuilder + from xgboost import XGBClassifier + + # Initialize HyperparameterSpaceBuilder + builder = HyperparameterSpaceBuilder() + + # Add evolvable parameters + builder.add_int_param("n_estimators", 50, 300) + builder.add_float_param("learning_rate", 0.01, 0.3) + builder.add_categorical_param("booster", ["gbtree", "dart"]) + + # Set fixed parameters + builder.set_fixed_param("max_depth", 5) + + # Build the custom hyperparameter space + custom_hyperparam_space = builder.build() + + # Initialize GeneticSearch with custom space + opt = GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=custom_hyperparam_space, + genetic_params_dict={"generations": 5, "population_size": 10} + ) + + # Run optimization + opt.fit(X, y) + +This example showcases adding custom integer, float, and categorical parameters, as well as fixed parameters to fine-tune the optimization process for :class:`XGBClassifier `. + +Reproducibility +--------------- + +For consistent results, you can set a `seed` in :class:`GeneticSearch `. This ensures that repeated runs yield identical results, which is essential for experimental reproducibility. + +Example: + +.. code-block:: python + + from mloptimizer.interfaces import GeneticSearch, HyperparameterSpaceBuilder + from xgboost import XGBClassifier + from sklearn.datasets import load_iris + + # Load dataset + X, y = load_iris(return_X_y=True) + + # Define default hyperparameter space + hyperparam_space = HyperparameterSpaceBuilder.get_default_space(XGBClassifier) + + # Initialize optimizer with a fixed seed + opt = GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=hyperparam_space, + genetic_params_dict={"generations": 5, "population_size": 10}, + seed=42 + ) + + # Run optimization + opt.fit(X, y) + +Setting the same `seed` value across multiple runs will produce identical results, enabling reliable comparison between experiments. + +.. warning:: + + On macOS with newer processor architectures (e.g., M1 or M2 chips), users may experience occasional reproducibility issues due to hardware-related differences in random number generation and floating-point calculations. To ensure consistency across runs, we recommend running `mloptimizer` within a Docker container configured for reproducible behavior. This approach helps isolate the environment and improves reproducibility on macOS hardware. diff --git a/docs/sections/Quickstart/index.rst b/docs/sections/Quickstart/index.rst new file mode 100644 index 0000000..9966c6b --- /dev/null +++ b/docs/sections/Quickstart/index.rst @@ -0,0 +1,25 @@ +Quick Start +================== + +This Quick Start guide will introduce you to the core steps for hyperparameter optimization using `mloptimizer`. By leveraging `GeneticSearch` and `HyperparameterSpaceBuilder`, you’ll gain control over tuning your model’s performance with a streamlined approach similar to `GridSearchCV` from :mod:`scikit-learn`. + +.. toctree:: + step1 + step2 + step3 + step4 + +Overview of Steps +----------------- + +1. **Step 1: Setting Up an Optimization with GeneticSearch** + Begin by setting up `GeneticSearch` as your optimization engine. This step will show you how to initialize `GeneticSearch`, configure the genetic algorithm parameters, and use it seamlessly within machine learning pipelines, following a familiar approach to `GridSearchCV`. + +2. **Step 2: Defining Hyperparameter Spaces with HyperparameterSpaceBuilder** + Define your search space using `HyperparameterSpaceBuilder`. Learn how to create flexible, robust hyperparameter spaces with fixed and evolvable parameters, either through default setups or custom configurations tailored to your model’s needs. + +3. **Step 3: Running and Monitoring Optimization** + Execute and monitor the optimization process with `GeneticSearch`. This step guides you through running the optimization and tracking progress by observing key metrics, helping you understand the performance of each generation. + +4. **Step 4: Reviewing and Interpreting Results** + Finally, assess the outcomes of the optimization process. This step explains how to identify the best estimator, analyze key performance indicators, and interpret results with practical examples to make data-driven adjustments. diff --git a/docs/sections/Quickstart/step1.rst b/docs/sections/Quickstart/step1.rst new file mode 100644 index 0000000..f2a16bd --- /dev/null +++ b/docs/sections/Quickstart/step1.rst @@ -0,0 +1,120 @@ +Setting Up an Optimization with GeneticSearch +===================================================== + +In this step, you’ll learn how to set up `GeneticSearch` as an optimizer for your machine learning model, using it similarly to `GridSearchCV` in :mod:`scikit-learn`. `GeneticSearch` is compatible with any model that adheres to the :class:`Estimator ` API, making it easy to integrate into pipelines. This guide covers initializing `GeneticSearch`, configuring key parameters, and using it to optimize model hyperparameters efficiently. + +Overview of GeneticSearch +------------------------- + +`GeneticSearch` is an optimization class based on genetic algorithms, a powerful search technique that reduces search time by iteratively refining solutions. By treating each set of hyperparameters as an “individual” in a population, `GeneticSearch` evolves the population over multiple generations to find the optimal configuration. This approach is particularly useful for large or complex search spaces where traditional grid or random search would be too computationally expensive. + +Configuring Genetic Parameters +------------------------------ + +The `genetic_params_dict` allows you to control how `GeneticSearch` performs the optimization. By default, the following parameters are set but can be customized to refine the genetic algorithm’s behavior. These parameters reference DEAP’s genetic algorithm operations and include: + +- **generations**: Number of evolutionary cycles the genetic algorithm will run. More generations allow for deeper refinement but increase computational time. +- **population_size**: Number of individuals in each generation. Larger populations explore the search space more thoroughly but require more computational resources. +- **cxpb** (crossover probability): The probability of crossover, where two individuals combine to create offspring. Higher values promote diversity in the population. +- **mutpb** (mutation probability): Probability of mutation, introducing random changes in individuals. Higher values increase diversity but may slow convergence. +- **n_elites**: Number of top-performing individuals to retain in each generation. This keeps the best-performing solutions in the population, aiding stability. +- **tournsize**: Tournament size, which controls the selection pressure. Larger values make selection more competitive, favoring fitter individuals. +- **indpb**: Independent probability of mutating each attribute. This fine-tunes the mutation process, with higher values leading to more exploratory changes. + +**Custom Genetic Parameters**: + +The example below demonstrates how to override the default `genetic_params_dict` with specific values for `generations`, `population_size`, `cxpb`, `mutpb`, and other parameters to control the optimization behavior. + +.. code-block:: python + + # Initialize GeneticSearch with custom genetic parameters + opt = GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=hyperparam_space, + genetic_params_dict={ + "generations": 20, + "population_size": 30, + "cxpb": 0.7, + "mutpb": 0.4, + "n_elites": 5, + "tournsize": 4, + "indpb": 0.1 + }, + seed=42 # Set seed for reproducibility + ) + + print("GeneticSearch initialized with custom genetic parameters.") + +Basic Initialization +-------------------- + +Here’s how to initialize `GeneticSearch` in a simple example: + +1. **Load your data**: Start by loading or preparing your dataset, ensuring you have features (`X`) and labels (`y`). +2. **Define the model and hyperparameter space**: Choose a model (e.g., :class:`XGBClassifier `) and set up a hyperparameter space with `HyperparameterSpaceBuilder`. +3. **Initialize GeneticSearch**: Use the chosen model, hyperparameter space, and genetic algorithm parameters to set up `GeneticSearch`. + +### Example: Setting Up GeneticSearch with Default Parameters + +This example demonstrates using `GeneticSearch` to optimize an `XGBClassifier` with default parameters. + +.. code-block:: python + + from mloptimizer.interfaces import GeneticSearch, HyperparameterSpaceBuilder + from xgboost import XGBClassifier + from sklearn.datasets import load_iris + + # 1) Load the dataset + X, y = load_iris(return_X_y=True) + + # 2) Define the hyperparameter space (using default space for XGBClassifier) + hyperparam_space = HyperparameterSpaceBuilder.get_default_space(XGBClassifier) + + # 3) Initialize GeneticSearch + opt = GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=hyperparam_space, + genetic_params_dict={"generations": 10, "population_size": 20} + ) + + # Ready to run optimization in the next step + print("GeneticSearch initialized and ready for optimization.") + +Incorporating GeneticSearch into Pipelines +------------------------------------------ + +One of the benefits of `GeneticSearch` is that it can be treated similarly to `GridSearchCV`, enabling integration into `scikit-learn` pipelines. Here’s an example using `Pipeline` to chain data preprocessing and model optimization. + +.. code-block:: python + + from sklearn.pipeline import Pipeline + from sklearn.preprocessing import StandardScaler + + # Define a preprocessing and optimization pipeline + pipeline = Pipeline([ + ("scaler", StandardScaler()), # Standardize features + ("genetic_search", GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=hyperparam_space, + genetic_params_dict={"generations": 10, "population_size": 20}, + seed=42 + )) + ]) + + # Fit pipeline on the dataset + pipeline.fit(X, y) + + print("Pipeline with GeneticSearch completed.") + +This example shows how to integrate `GeneticSearch` with other preprocessing steps in a pipeline, treating it as you would any other estimator in :mod:`scikit-learn`. + +Summary +------- + +In this step, you learned to: + +1. Initialize `GeneticSearch` with a compatible model and hyperparameter space. +2. Configure essential genetic algorithm parameters to control the search process. +3. Incorporate `GeneticSearch` into a machine learning pipeline for seamless optimization. + +Once `GeneticSearch` is set up, you’re ready to define your hyperparameter space in Step 2, fine-tuning the search space to suit your model’s needs. diff --git a/docs/sections/Quickstart/step2.rst b/docs/sections/Quickstart/step2.rst new file mode 100644 index 0000000..b5e5a17 --- /dev/null +++ b/docs/sections/Quickstart/step2.rst @@ -0,0 +1,134 @@ +Defining Hyperparameter Spaces with HyperparameterSpaceBuilder +====================================================================== + +In this step, we will explore how to define hyperparameter spaces for optimization using :class:`HyperparameterSpaceBuilder `. This builder allows you to create a comprehensive set of hyperparameters, including both fixed values and evolvable parameters, to maximize the flexibility and effectiveness of your optimization with :class:`GeneticSearch `. + +Overview of HyperparameterSpaceBuilder +-------------------------------------- + +`HyperparameterSpaceBuilder` is a dedicated class for constructing hyperparameter spaces. It supports various data types, such as integers, floats, and categorical parameters, making it adaptable to a wide range of machine learning models. You can use it to define default spaces or create custom configurations tailored to your specific needs. + +Key Methods +----------- + +- **add_int_param(name, min_value, max_value)**: Adds an integer hyperparameter with a defined range to the space. +- **add_float_param(name, min_value, max_value, scale=100)**: Adds a float hyperparameter with a defined range and optional scaling. +- **add_categorical_param(name, values)**: Adds a categorical hyperparameter with a set of predefined values. +- **set_fixed_param(name, value)**: Sets a fixed parameter that will remain constant during optimization. +- **build()**: Builds and returns the completed hyperparameter space. +- **get_default_space(estimator_class)**: Loads a predefined hyperparameter space for the specified estimator. + +Using Default Hyperparameter Spaces +----------------------------------- + +`mloptimizer` provides predefined hyperparameter spaces for several common estimators, which can be loaded directly using the :meth:`get_default_space ` method. This is particularly useful for rapid experimentation, as these default configurations are tuned for compatibility with various algorithms. + +Supported estimators and their default hyperparameter spaces include: + +- **Decision Tree Models**: + - :class:`DecisionTreeClassifier ` + - :class:`DecisionTreeRegressor ` +- **Random Forest Models**: + - :class:`RandomForestClassifier ` + - :class:`RandomForestRegressor ` +- **Extra Trees Models**: + - :class:`ExtraTreesClassifier ` + - :class:`ExtraTreesRegressor ` +- **Gradient Boosting Models**: + - :class:`GradientBoostingClassifier ` + - :class:`GradientBoostingRegressor ` +- **XGBoost Models**: + - :class:`XGBClassifier ` + - :class:`XGBRegressor ` +- **Support Vector Machines**: + - :class:`SVC ` + - :class:`SVR ` + +These default hyperparameter spaces are available in JSON files within the repository, and `HyperparameterSpaceBuilder` can load them by referencing the relevant estimator class. + +Example: **Loading a Default Hyperparameter Space**: + +The example below demonstrates how to load a default hyperparameter space for :class:`XGBClassifier ` using :meth:`get_default_space `. + +.. code-block:: python + + from mloptimizer.interfaces import HyperparameterSpaceBuilder + from xgboost import XGBClassifier + + # Load the default hyperparameter space for XGBClassifier + hyperparam_space = HyperparameterSpaceBuilder.get_default_space(XGBClassifier) + + print("Default hyperparameter space loaded for XGBClassifier.") + +Creating a Custom Hyperparameter Space +-------------------------------------- + +In many cases, you may want to define a custom hyperparameter space with specific parameters tailored to your model. :class:`HyperparameterSpaceBuilder ` allows you to add parameters that :class:`GeneticSearch ` can evolve during optimization, as well as parameters with fixed values that will not change. + +Example: **Adding Evolvable Parameters**: + +The example below demonstrates how to use `HyperparameterSpaceBuilder` to add evolvable parameters (parameters that will be optimized) for a custom `XGBClassifier` configuration. + +.. code-block:: python + + from mloptimizer.interfaces import HyperparameterSpaceBuilder + + # Initialize the builder + builder = HyperparameterSpaceBuilder() + + # Add evolvable hyperparameters + builder.add_int_param("n_estimators", 50, 300) + builder.add_float_param("learning_rate", 0.01, 0.3) + builder.add_categorical_param("booster", ["gbtree", "dart"]) + + # Build the hyperparameter space + custom_hyperparam_space = builder.build() + + print("Custom evolvable hyperparameter space created.") + +Example: **Adding Fixed Parameters**: + +You can also set fixed parameters that will remain constant during the optimization process. Here’s an example of setting both evolvable and fixed parameters. + +.. code-block:: python + + # Set a fixed hyperparameter + builder.set_fixed_param("max_depth", 5) + + # Add evolvable parameters + builder.add_int_param("n_estimators", 100, 500) + builder.add_float_param("subsample", 0.5, 1.0) + + # Build the custom hyperparameter space + mixed_hyperparam_space = builder.build() + + print("Hyperparameter space with both fixed and evolvable parameters created.") + +Saving and Loading Hyperparameter Spaces +---------------------------------------- + +:class:`HyperparameterSpaceBuilder ` also provides functionality to save and load hyperparameter spaces, allowing you to reuse configurations across projects or experiments. + +Example: **Saving and Loading a Hyperparameter Space**: + +.. code-block:: python + + # Save the custom hyperparameter space to a file + builder.save_space(mixed_hyperparam_space, "custom_hyperparam_space.json", overwrite=True) + + # Load the saved hyperparameter space + loaded_hyperparam_space = builder.get_default_space(XGBClassifier) + + print("Hyperparameter space saved and reloaded.") + +Summary +------- + +In this step, we covered: + +1. Defining hyperparameter spaces using :class:`HyperparameterSpaceBuilder `. +2. Creating both fixed and evolvable parameters for flexible optimization. +3. Loading default hyperparameter spaces for supported models. +4. Saving and loading hyperparameter spaces for easy reuse. + +Once your hyperparameter space is defined, you’re ready to move on to Step 3, where we’ll execute and monitor the optimization process. diff --git a/docs/sections/Quickstart/step3.rst b/docs/sections/Quickstart/step3.rst new file mode 100644 index 0000000..591e422 --- /dev/null +++ b/docs/sections/Quickstart/step3.rst @@ -0,0 +1,105 @@ +Running and Monitoring Optimization +=========================================== + +In this step, we cover how to execute the optimization process with :class:`GeneticSearch ` and monitor key metrics. The progress output from `GeneticSearch` provides real-time feedback, allowing you to track the optimization’s performance across generations and assess convergence. + +Executing the Optimization +-------------------------- + +Once you have defined your model and hyperparameter space with :class:`HyperparameterSpaceBuilder `, you’re ready to execute the optimization with `GeneticSearch`. + +To start the optimization, call the `fit` method on your `GeneticSearch` instance, passing in the feature matrix (`X`) and target vector (`y`). This initiates the genetic algorithm, which runs the optimization over the specified number of generations and population size, iteratively refining hyperparameters. + +**Example: Running GeneticSearch** + +The example below demonstrates running `GeneticSearch` with a basic configuration: + +.. code-block:: python + + from mloptimizer.interfaces import GeneticSearch, HyperparameterSpaceBuilder + from xgboost import XGBClassifier + from sklearn.datasets import load_iris + + # Load the dataset + X, y = load_iris(return_X_y=True) + + # Define the hyperparameter space using the default space for XGBClassifier + hyperparam_space = HyperparameterSpaceBuilder.get_default_space(XGBClassifier) + + # Initialize GeneticSearch + opt = GeneticSearch( + estimator_class=XGBClassifier, + hyperparam_space=hyperparam_space, + genetic_params_dict={"generations": 10, "population_size": 20}, + seed=42 + ) + + # Execute the optimization + opt.fit(X, y) + + print("Optimization completed. Best estimator found.") + +Progress Monitoring Output +-------------------------- + +During the optimization process, `GeneticSearch` provides real-time feedback on the console, updating progress with each generation. The typical output includes information about: + +- **Progress Percentage**: Displays the percentage of generations completed out of the total specified. +- **Best Fitness**: Shows the highest score (fitness) achieved so far, reflecting the performance of the best hyperparameter set found by the algorithm. +- **Generation Speed**: Indicates the processing rate (e.g., iterations per second) to give an estimate of runtime. + +**Example Output** + +Here’s a sample output from `GeneticSearch` while running the `fit()` method: + +.. code-block:: text + + WARNING:root:The folder . already exists and it will be used + INFO:mloptimizer.log:Initiating genetic optimization... + INFO:mloptimizer.log:Algorithm: Optimizer + + Genetic execution: 0%| | 0/31 [00:00`, it’s essential to interpret the outcomes to understand the best-performing model, its hyperparameters, and the optimization’s overall effectiveness. This step will guide you through accessing the best model, understanding cross-validation vs. full-dataset training, interpreting key metrics, and visualizing the optimization process. + +Accessing the Best Model and Parameters +---------------------------------------- + +Once optimization completes, `GeneticSearch` provides access to the best model and hyperparameters found. The best model is stored as `opt.best_estimator_`, containing the optimal hyperparameters identified by the algorithm. + +**Note**: The model in `best_estimator_` (and the one returned by `fit()`) is retrained on the entire dataset (`X` and `y`) provided to the optimization, regardless of the cross-validation method used during the search. During optimization, cross-validation is used only to assess candidate models’ fitness scores; the final model is fully trained to ensure the best performance on all available data. + +**Example: Accessing and Displaying the Best Model** + +.. code-block:: python + + # Display the best estimator found during optimization + print("Best Estimator:") + print(opt.best_estimator_) + +The output will show the details of the best model, including its class and the optimized hyperparameters: + +.. code-block:: text + + DecisionTreeClassifier(ccp_alpha=0.00055, max_depth=4, + min_impurity_decrease=0.001, min_samples_split=5, + random_state=296596) + +**Example: Accessing Best Hyperparameters and Fitness Score** + +The best hyperparameters and fitness score can be accessed directly through `best_params_` and `best_fitness_` attributes. + +.. code-block:: python + + # Display the best hyperparameters and fitness score + print("Best Hyperparameters:", opt.best_params_) + print("Best Fitness Score:", opt.best_fitness_) + +These attributes provide a summary of the best configuration found and its performance score, allowing you to assess the quality of the optimized model. + +**Using the Best Model for Prediction or Scoring** + +You can now use the `best_estimator_` model to make predictions or evaluate its performance on new data. For example: + +.. code-block:: python + + # Using the best model to make predictions + y_pred = opt.best_estimator_.predict(X_test) + + # Evaluating its performance on a test set + score = opt.best_estimator_.score(X_test, y_test) + print("Test Set Score:", score) + +Key Considerations for Result Interpretation +-------------------------------------------- + +When reviewing results, consider the following: + +- **Convergence Behavior**: If the evolution graph shows that fitness scores stabilized, the algorithm likely found an optimal solution. If scores continued improving, additional generations may yield further gains. +- **Parameter Sensitivity**: The search space graph can help identify parameters that had a strong impact on fitness. Hyperparameters with a narrower range near high fitness scores are likely more sensitive. +- **Validation**: For a comprehensive performance assessment, evaluate the `best_estimator_` on a separate validation or test set if available. This provides an unbiased measure of its effectiveness on new data. +- **Generalizability**: If you plan to use this model for similar tasks, the best hyperparameters identified can serve as a strong starting point for future optimizations. + +Visualizing Optimization Results +-------------------------------- + +To gain insights into the optimization process, you can visualize the fitness evolution over generations and the search space explored by the genetic algorithm. `mloptimizer` includes built-in functions to generate these plots. + +### Evolution (Logbook) Graph + +The evolution graph displays the fitness function’s progress across generations, showing the maximum, minimum, and average fitness values for each generation. This visualization helps you understand the convergence pattern and whether the optimization reached a stable solution. + +**Example: Generating the Evolution Graph** + +.. code-block:: python + + from mloptimizer.application.reporting.plots import plotly_logbook + import plotly.io as pio + + # Plot the evolution graph + population_df = opt.populations_ + evolution_graph = plotly_logbook(opt.logbook_, population_df) + pio.show(evolution_graph) + +In this graph: +- **Black lines** represent the max and min fitness values across generations. +- **Green, red, and blue lines** correspond to the max, min, and average fitness values per generation. +- **Gray points** indicate individual fitness values within each generation, providing a sense of population diversity. + +At the end of the optimization, the evolution graph is saved as an HTML file for easy reference. For the location of the saved plot, refer to the results folder’s structure in the documentation: :doc:`../Results/directory_structure`. + +### Search Space Graph + +The search space graph visualizes the hyperparameter values explored by the genetic algorithm. This plot shows the range of values tested for each hyperparameter and highlights the fitness scores associated with each combination, providing insight into the hyperparameter landscape. + +**Example: Generating the Search Space Graph** + +.. code-block:: python + + from mloptimizer.application.reporting.plots import plotly_search_space + + # Get population data and relevant parameters + population_df = opt.populations_ + param_names = list(opt.get_evolvable_hyperparams().keys()) + param_names.append("fitness") + + # Create the search space plot + search_space_graph = plotly_search_space(population_df[param_names], param_names) + pio.show(search_space_graph) + +In the search space graph: +- Each point represents a unique hyperparameter configuration tested by the genetic algorithm. +- The distribution of points shows the explored search space, helping you identify which hyperparameter ranges yielded higher fitness scores. + +Results and Directory Structure +------------------------------- + +After optimization completes, `GeneticSearch` generates a results folder containing detailed information about the best model and other optimization data. This folder includes: + +- **Best Model Details**: Information on the best-performing model and its hyperparameters. +- **Evolution Log**: Data on fitness scores and hyperparameter values for each generation. +- **Saved Visualizations**: HTML files for the evolution and search space graphs. + +For more details on the results folder structure, refer to the documentation: :doc:`../Results/directory_structure`. + +Summary +------- + +In this final step, we covered: + +1. Accessing the best model and interpreting its hyperparameters and fitness score. +2. Using the best model for predictions or scoring on test data. +3. Visualizing the optimization process using evolution and search space graphs. +4. Understanding and interpreting optimization trends and parameter sensitivity. + +This concludes the Quick Start guide. You’re now equipped to optimize hyperparameters using `GeneticSearch` and interpret the outcomes effectively, enabling you to fine-tune models for improved performance on your tasks. diff --git a/docs/sections/Results/directory_structure.rst b/docs/sections/Results/directory_structure.rst new file mode 100644 index 0000000..5550e50 --- /dev/null +++ b/docs/sections/Results/directory_structure.rst @@ -0,0 +1,49 @@ +============================= +Optimizer Directory Structure +============================= + +When an optimizer is run, it generates a directory within the current working directory (or the specified output directory) to store all optimization results. This directory, named in the format `YYYYMMDD_nnnnnnnnnn_OptimizerName`, organizes the outputs from the optimization process, including the best estimator, checkpoints, visualizations, logs, and progress data. + +Directory Structure +------------------- + +The directory structure is organized as follows: + +.. code-block:: bash + + ├── checkpoints + │ ├── cp_gen_0.pkl + │ └── cp_gen_1.pkl + ├── graphics + │ ├── logbook.html + │ └── search_space.html + ├── opt.log + ├── progress + │ ├── Generation_0.csv + │ └── Generation_1.csv + └── results + ├── logbook.csv + └── populations.csv + +Directory Contents +------------------ + +Each directory and file serves a specific purpose: + +- **checkpoints**: Contains serialized checkpoint files for each generation. These checkpoints save the optimizer's state, allowing you to resume the process from a specific generation if necessary. + - `cp_gen_0.pkl`, `cp_gen_1.pkl`: Checkpoints for each generation, named by generation number, saved in Python's pickle format. + +- **graphics**: Stores HTML visualizations of the optimization process. + - `logbook.html`: An interactive logbook visualization showing optimization statistics and trends over generations. + - `search_space.html`: A visualization of the search space, showing how hyperparameters were explored during optimization. + +- **opt.log**: The primary log file containing detailed information about the optimization process, including metrics and outcomes for each generation. + +- **progress**: Stores CSV files with detailed information about each generation's progress. + - `Generation_0.csv`, `Generation_1.csv`: These files contain records of each individual in the population for each generation, including hyperparameters and fitness scores. + +- **results**: Contains summary CSV files of the optimization results. + - `logbook.csv`: A CSV version of the logbook, recording generation-by-generation statistics of the optimization process. + - `populations.csv`: Final population data, including hyperparameters and fitness values of each individual in the last generation. + +Each of these directories and files is structured to help you analyze and interpret the optimization process in detail, from individual generations to final results and visualizations. diff --git a/docs/sections/Results/evolution_graph.rst b/docs/sections/Results/evolution_graph.rst new file mode 100644 index 0000000..f1749e0 --- /dev/null +++ b/docs/sections/Results/evolution_graph.rst @@ -0,0 +1,41 @@ +Evolution Graph +=============== + +The evolution graph visualizes the progression of fitness scores across generations in a genetic optimization process. This plot helps you understand the algorithm’s convergence behavior, track improvements in fitness scores, and observe the distribution of individual scores within each generation. + +Gallery Example +--------------- +See the following example for practical usage and code details: + +.. list-table:: + :widths: 25 75 + :header-rows: 1 + + * - Example + - Description + * - :ref:`sphx_glr_auto_examples_plot_evolution.py` + - Demonstrates how to set up and plot the evolution graph of a genetic algorithm optimization process. + + +Overview +-------- + +The evolution graph highlights key metrics throughout the optimization process: + +- **Max and Min Fitness Lines**: Black lines showing the overall max and min fitness values across generations. +- **Generation-Based Metrics**: Green, red, and blue lines for max, min, and average fitness within each generation. +- **Individual Fitness Scores**: Gray points representing fitness scores of each individual, illustrating the population’s diversity and convergence at each stage. + +The `mloptimizer` library provides a function to generate this graph using Plotly, making it interactive and customizable. + +Saved Graph and Data Files +-------------------------- + +After the optimization completes, the evolution graph and related data are saved for future reference: + +- **Graph Path**: An HTML file of the evolution graph is saved in the `graphics` directory. +- **Data Path**: CSV files with population data and logbook statistics are saved in the `results` directory. + +For a detailed directory layout, refer to :doc:`directory_structure`. + +**Note**: The evolution graph helps identify whether the genetic algorithm has converged or if additional generations might improve fitness further. diff --git a/docs/sections/Results/index.rst b/docs/sections/Results/index.rst new file mode 100644 index 0000000..01d4047 --- /dev/null +++ b/docs/sections/Results/index.rst @@ -0,0 +1,21 @@ +Analyzing and Visualizing Results +================================= + +This section provides tools and guidance to assess optimization success, extract insights, and further refine your model. Learn how to explore output files, visualize optimization progress, and interpret search space results. + +.. toctree:: + :hidden: + + directory_structure + evolution_graph + search_space_graph + + +Overview of Contents +-------------------- + +- **Explore Directory Structure**: Review and understand the organization of output files, including checkpoints, logs, and result summaries, to locate key information easily. + +- **Generate an Evolution Graph**: Visualize how fitness scores evolved across generations to assess convergence and track optimization progress. + +- **Map the Search Space**: Examine the hyperparameter search space to see which configurations yielded the highest fitness scores, providing insights into parameter sensitivity and tuning effectiveness. diff --git a/docs/sections/Results/search_space_graph.rst b/docs/sections/Results/search_space_graph.rst new file mode 100644 index 0000000..9e9ccfd --- /dev/null +++ b/docs/sections/Results/search_space_graph.rst @@ -0,0 +1,51 @@ +Search Space Graph +================== + +The search space graph provides a visualization of the hyperparameter values explored during the genetic optimization process. This plot reveals the distribution of evaluated hyperparameters and helps you identify the value ranges that achieved higher fitness scores, offering valuable insights into the search space landscape and parameter sensitivity. + +Gallery Example +--------------- + +Refer to the following example for practical usage and code details: + +.. list-table:: + :widths: 25 75 + :header-rows: 1 + + * - Example + - Description + * - :ref:`sphx_glr_auto_examples_plot_search_space.py` + - Demonstrates how to set up and plot the search space graph for a genetic algorithm optimization process. + + +Understanding the Search Space Graph +------------------------------------ + +The search space graph is a scatter plot that visually represents the hyperparameter configurations tested by the genetic algorithm: + +- **Axes**: Each axis represents an evolvable hyperparameter. This allows you to observe the ranges of values explored during the optimization process. +- **Points**: Each point in the graph corresponds to a unique combination of hyperparameters, with its position representing specific values for each parameter. +- **Fitness Scores**: Points are typically color-coded or sized based on fitness scores, highlighting which parameter combinations yielded the best results. + +This visualization helps you identify hyperparameter ranges associated with higher fitness scores, providing insights into parameter sensitivity and guiding future optimizations. + +How to Use This Graph +--------------------- + +The search space graph is especially useful for assessing: + +- **Parameter Sensitivity**: By observing clusters of high-fitness points, you can identify hyperparameters that significantly impact model performance. +- **Value Ranges for Further Tuning**: If certain parameter ranges are associated with better fitness, you can refine future optimization runs to focus on those areas. +- **Relationships Between Parameters**: The graph can reveal interactions between hyperparameters, such as values that consistently lead to higher fitness when used in combination. + +Saved Graph and Data Files +-------------------------- + +After the optimization completes, the search space graph and related data are saved for future reference: + +- **Graph Path**: An HTML file of the search space graph is saved in the `graphics` directory. +- **Data Path**: CSV files with population data, including hyperparameter values and fitness scores, are saved in the `results` directory. + +For a detailed directory layout, refer to :doc:`directory_structure`. + +**Takeaway**: The search space graph provides valuable insights into hyperparameter effectiveness and sensitivity, allowing you to identify which parameters significantly impact performance and explore promising value ranges for further tuning. diff --git a/docs/sections/introduction.rst b/docs/sections/introduction.rst deleted file mode 100644 index 1f6595a..0000000 --- a/docs/sections/introduction.rst +++ /dev/null @@ -1,103 +0,0 @@ -============ -Introduction -============ -This user guide is an introduction to the mloptimizer library, -designed to optimize machine learning models with a focus on ease of use of the Deap library. -The guide will demonstrate the library's capabilities through examples and highlight its features and customization options. - -mloptimizer is intended to complement detailed API documentation, offering practical insights and optimal usage strategies. - -While mloptimizer integrates seamlessly with Python's machine learning ecosystem, -it's built on Deap optimization algorithms, which are not specific to machine learning. -This guide primarily uses Python examples, providing a -straightforward path for practitioners familiar with Python-based machine learning libraries. - -Features --------- -The goal of mloptimizer is to provide a user-friendly, yet powerful optimization tool that: - -- Easy to use -- DEAP-based genetic algorithm ready to use with several machine learning algorithms -- Compatible with any machine learning algorithm that complies with the Scikit-Learn API -- Default hyperparameter spaces for the most common machine learning algorithms -- Default score functions for evaluating the performance of the model -- Reproducibility of results -- Extensible with more machine learning algorithms that comply with the Scikit-Learn API -- Customizable hyperparameter ranges -- Customizable score functions - - -Using mloptimizer ------------------ - -Step 1: Select and Setup the Algorithm to Optimize -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -mloptimizer uses a wrapper, `SklearnOptimizer`, for the algorithm for classification or regression that is going to be optimized. -It currently have default hyperparameter spaces the following algorithms: - -- `DecisionTreeClassifier`: Decision Tree Classifier from scikit-learn -- `RandomForestClassifier`: Random Forest Classifier from scikit-learn -- `ExtraTreesClassifier`: Extra Trees Classifier from scikit-learn -- `GradientBoostingClassifier`: Gradient Boosting Classifier from scikit-learn -- `SVC`: Support Vector Classifier from scikit-learn -- `KerasClassifier`: Custom Keras Classifier class -- `XGBClassifier`: XGBoost Classifier - -Let’s assume that we want to fine-tune the decision tree classifier from scikit-learn, wrapped in `SklearnOptimizer`. - -To instantiate the wrapper, you need to specify the class of the machine learning algorithm, -the dataset to work with, the hyperparameter space (fixed and evolvable), the input features (as a matrix), -and the output features (as a column). - -The wrapper has a variable with the set of hyperparameters to be explored. -For the case of the decision tree classifier in `DecisionTreeClassifier` from `sklearn.tree` -the default hyperparameters and their exploration ranges are: - -- `min_samples_split`, range [2, 50] -- `min_samples_leaf`, range [1, 20] -- `max_depth`, range [2, 20] -- `min_impurity_decrease`, range [0, 0.15] in 1000 steps -- `ccp_alpha`, range [0, 0.003] in 100,000 steps - -For a quick start, we will explore the default hyperparameters using a default range for exploring each of them. - -Similarly, in the wrapper, you can set up the metric to be optimized -(the parameter is called `score_function` and the default value is accuracy) -from the metrics available in scikit-learn (`sklearn.metrics`) -and the evaluation setting (the parameter is called `model_evaluation` -and the default value is the `train_score`). - -See the API reference for more details on setting up the wrapper and the optimization. - -Step 2: Running the Optimization -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Once you have instantiated the wrapper with the algorithm to optimize, you can run the genetic optimization. - -Typically, you should set the number of generations (3 by default) and the size of the population (10 by default). - -The optimization returns the best classifier found during the genetic optimization, -tuned with the corresponding hyperparameters. Additionally, during the optimization, a structure of -directories is created to store the results of the optimization process. The structure of the directories is -explained in the section on the optimizer output directory structure and contain useful information, logs, -checkpoints and plots. - -Step 3: Using the Outcome of the Optimization Process -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The result of the optimization process is the optimal classifier object. -You can use this object to make predictions on your dataset. -For example, if `clf_result` is the returned classifier, you can use `clf_result.predict(X)` to make predictions. - -In addition to the optimal classifier, -you can explore the outcomes of the optimization process, -such as the evolution of the population and the best score at each generation. -These outcomes are stored in the directory created by the optimizer, -as explained in the section on the optimizer output directory structure. - -.. warning:: - mloptimizer is not a machine learning library. It is a hyperparameter optimization library that can be used with any machine learning library that complies with the scikit-learn API. - -.. warning:: - Before optimizing a machine learning model using mloptimizer it is recommended first to have a cleaned dataset. mloptimizer does not provide any data preprocessing or cleaning tools. - -.. note:: - The examples in this guide are aligned with the latest version of mloptimizer. Users are encouraged to ensure they are using the most recent release to fully leverage the library's capabilities. diff --git a/docs/sections/user_guide.rst b/docs/sections/user_guide.rst index 0d6f13b..4b39a32 100644 --- a/docs/sections/user_guide.rst +++ b/docs/sections/user_guide.rst @@ -8,6 +8,7 @@ User Guide :maxdepth: 3 :numbered: - introduction - Basics/index - Concepts/index \ No newline at end of file + Introduction/index + Quickstart/index + Results/index + Advanced/index \ No newline at end of file