**Ensemble learning** combines multiple weak learners to build a strong learner.

# Simple techniques

## Majority voting

`Majority voting`

is generally used in **classification** tasks: take the `majority`

of the model predictions as the final prediction.

## Averaging

Take the `average of predictions`

from all the models as the final prediction, in `regression`

or `classification`

(the average of probabilities) tasks.

## Weighted average

Assign `different weights`

defining the importance of each model for different models.

# Advanced ensemble techniques

## Stacking

Stacking (a.k.a. Stacked Generalization, or Super Learner) employs a number of first-layer individual learners (model 1-5 / Tier-1 in the below figures) generated from the training data set, followed by a second-level learner (model 6 / Tier-2, a.k.a. *meta-learner*).

## Blending

Similar to stacking, but use **only a devset** from the training set to make predictions. The devset and the predictions are used to build the model on test set.

## Bagging

Bagging(**B**ootstrap **agg**regat**ing**):

- Bootstrapped subsampling;
- Fit the base model on each of these subsets;
- Models are run
**in parallel**and independent of each other; - The final prediction are determined by combining all model predictions.

### Random forest

**Pros**:

- Robust against outliers and noise;
*Reduce variance*and typically avoids overfitting;- Fast run time;

**Cons**:

- Can be slow to score as complexity increases;
- Lack of transparency due to the complexity of multiple trees;

## Boosting

Boosting is a **sequential process**, where each subsequent model attempts to correct the errors of the previous model.

- Create a subset of all dataset.
- Initially, all data points are given the same weights.
- Fit a base model on this subset.
- Use this base model to predict on the whole dataset.
- Calculate errors using golden standard and predictions.
- The wrongly predicted data are given
**higher weights**. - Another model is created with step 3-6. (in order to
**correct the errors from the previous model**) - Multiple models are created, each correcting the error of the previous model.
- The final model (strong learner) is the weighted mean of all the models.

### AdaBoost

Adaptive Boosting

**Pros**:

- Often the best possible model;
- Directly optimize the
**cost function**;

**Cons**:

- Not robust against outliers and noise;
- Can overfit;
- Need to find proper stopping point;

### Comparing bagging and boosting

Model error arises from noise, bias, and variance.

*Noise*is error by the target function;*Bias*is where the algorithm cannot learn the target;*Variance*comes from sampling.

`Boosting`

is recommended on models that have a **high bias**, not Bagging.

Conversely, `Bagging`

is recommend for cases of **high variance**, rather than Boosting.

### GBM(Gradient Boosted Models)

### XGBoost

XGBoost^{[1]}

### Light GBM

It is useful for large-size dataset.

### CatBoost

# References

- 1.Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD. ↩
- 2.Zhou Z. (2016) Ensemble learning ↩