Ensemble methods

To get started with ensemble methods, clone the repository and follow the examples provided in the examples directory. Ensure you have the necessary dependencies installed, which can be done using pip install -r requirements.txt.

Ensemble methods

Ensemble refers to group of models working together. Ensemble methods are the techniques to build a hybrid model by combining multiple models together.The intuition behind this approach is capitalize on individual models strengths while mitigating their weaknesses.Each model captures slightly different aspect of data making an ensemble more robust. Ensemble methods often perform better compared to indivisual models. The results from all the models in the ensemble are aggregated to form the final result. In classification tasks class with highest voting is predicted as final result where as in regression tasks average of all the results is predicted as final result.

Note : Models that are used to build a ensemble (strong classifier) are referred as base model or weak classifier.

Types of Ensemble methods are

1. Bagging

2. Boosting

3. Stacking

4. Voting

1. Bagging : Also referred as bootstrap aggregating. In this ensemble approach, multiple instances of same base model are trained on different subsets of training data. This method aims to capture various patterns from the data by creating diverse training sets. Each subset is randomly selected with replacement, a process known as bootstrap sampling, and each subset has the same number of samples as the original dataset. These models are trained independently on their respective subsets. A well-known example of this technique is Random Forest, which combines multiple decision trees to improve overall performance

2. Boosting : In this ensemble approach, models are trained in sequence where each subsequent model focuses on errors made by previous models. The intuition behind this approach is to learn from the mistakes.

Popular boosting methods are

1. AdaBoosting : Short form of Adaptive Boosting. In this approach, classifiers are trained sequentially, giving more weight to data points that are misclassified in previous rounds.

Training process :

Step 1. At the start of training, every sample is given the same importance. If there are N samples, each one is assigned a weight of 1 divided by N.The training data looks like the following

Features	Target	Weights
feature A	target X	1 / N
feature B	target Y	1 / N
feature C	target X	1 / N
feature D	target Z	1 / N

Step 2. A weak classifier is trained on weighted training data. After training, we calculate the error by evaluating the same weak classifier on same training data. We use this error to increase the weights of the misclassified examples and decrease the weights of the correctly classified ones.Let the α be the error, updated training data looks like the following.

Features	Target	Old Weights	Correctly predicted	New Weights
feature A	target X	1 / N	Yes	1 / N × exp(−α)
feature B	target Y	1 / N	No	1 / N × exp(α)
feature C	target X	1 / N	No	1 / N × exp(α)
feature D	target Z	1 / N	Yes	1 / N × exp(-α)

Step 3. The second weak classifier is trained the weight adjusted data. Then evaluated and weights are adjusted accordingy. This process is repeatly until desired number of weak classifiers are trained.

-→ AdaBoost improves accuracy by focusing on mistakes. The loss function is designed to pay more attention for the samples with higher weights.

-→ The Sklearn version of AdaBoost has two versions boosting algorithms , SAMME(original version) and SAMME.R(updated version). SAMME.R tends to perform better compared to SAMME

1. Gradient Boosting Machines(GBM) : In GBM, models are trained in sequence, with each subsequent model focuses on residuals to improve ensemble accuracy.

Training process :

Step 1. Initally a weak classifier is trained on entire training data and then evaluated to calcualte residuals

For Regression (House Price Prediction):

Median Income	Number of Bedrooms	Target Price	Predicted Target Price	Residual
$60,000	3	$350,000	$340,000	$10,000
$75,000	4	$450,000	$460,000	-$10,000
$50,000	2	$250,000	$255,000	-$5,000
$80,000	5	$500,000	$490,000	$10,000

Classification (Binary classification with 0.5 threshold ):

Feature 1	Feature 2	True Label	Predicted probability	Residual
321	23	1	0.75	0.25
512	21	0	0.65	-0.65
599	312	1	0.85	0.15
621	311	0	0.70	-0.70

Step 2. The next model is trained to predict these residuals. The training data for sub sequent model looks like the following.

For Regression (House Price Prediction):

Median Income	Number of Bedrooms	Residual
$60,000	3	$10,000
$75,000	4	-$10,000
$50,000	2	-$5,000
$80,000	5	$10,000

Classification (Binary classification with 0.5 threshold ):

Feature 1	Feature 2	Residual
321	23	0.25
512	21	-0.65
599	312	0.15
621	311	-0.70

Step 3. After training the second model, calculate the residuals for the entire ensemble (both the first and second weak classifiers). Use these residuals to train the next model. Repeat this process until you have the desired number of weak classifiers.

Advanced Gradient Boosting Techniques :

1. XGBoost (Extreme Gradient Boosting): This is an optimized version of GBM with additional features for better performance and efficiency

Features :

Regularization: Adds L1 and L2 regularization to reduce overfitting.

Tree Pruning: Uses depth-first search for tree growth and pruning.

Missing Values Handling: Supports both CPU and GPU acceleration.

Parallel Processing: Adds L1 and L2 regularization to reduce overfitting.

Histogram-Based Splitting: Uses bins to split data, improving efficiency.

2. LightGBM (Light Gradient Boosting Machine): Designed for faster training and lower memory usage with large datasets.

Features :

Histogram-Based Algorithms: Efficiently handles large datasets by using histogram-based approaches.

Speed and Efficiency: Optimized for faster training and prediction.

3. CatBoost : Handles categorical features automatically and is robust to various data distributions.

Features :

Categorical Feature Handling: Directly processes categorical features.

Regularization: Uses advanced regularization techniques to reduce overfitting.

4. Lightning Boost : An extension of XGBoost, focusing on faster training and scalability.

Features :

Improved Efficiency: Better optimization for large datasets compared to XGBoost.

Accelerated Training: Supports both CPU and GPU acceleration.

3. Stacking ensemble : Stacking is an ensemble learning method where multiple models are trained together in layers. Several base models are trained, and they can be different types. Then, a meta-model is trained using the predictions from these base models. The idea is that the meta-model learns which base models to trust in different situations. But stacking is less popularly used compared to other ensemble methods.

4. Voting ensemble : Voting is a straightforward ensemble learning technique that combines the predictions of multiple models to make a final decision.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
images		images
Ensemble_Bagging_RandomForest.ipynb		Ensemble_Bagging_RandomForest.ipynb
Ensemble_Boosting_AdaBoost.ipynb		Ensemble_Boosting_AdaBoost.ipynb
Ensemble_Boosting_GBM.ipynb		Ensemble_Boosting_GBM.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ensemble methods

Types of Ensemble methods are

About

Releases

Packages

Languages

License

nandapg0204/Ensemble-methods

Folders and files

Latest commit

History

Repository files navigation

Ensemble methods

Types of Ensemble methods are

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages