All objective functions based on absolute value (square, absolute, Huber... ) have implicit parametric distribution assumption in disguise (Gaussian, Laplace...), thus fail badly when assumption not satisfied.
To have a robust (outliers-free) regression tree, we should use objective function only depends on relative value.
Let's say we have a decision stump on the dth dimension that splits data points into reds(xi=1) and blues(xi=0).
To find the best split, first define a objective function depends only on relative values of data points.
Kendall's Tau, also known as bubble sort distance. Defined as follow:
When i, j have same color, goes to zero, thus it's equivalent to:
There is a trade-off between balance & difference.
In practice, we can use other impurities (entropy...) & differences (log odds...) as long as
.
Note that:
(conditional probability, draw one blue & one red without replacement)
where
is kendall rank correlation coefficient shown before.
where r is pearson correlation coefficient.
Calculate following for every possible stump,
-
Initialize:
Set all points to blue ,
tau
= 0,maximum
= 0
Sort samples byXd
as a queue -
get next item
i
from queue, turn it red. -
Calculate tau, record stump if |
tau
| >maximum
-
repeat 2. until all points are red
In short:
With precompute:
Then for every depth:
- calculate cummulate sum for each leaf
Just like what elastic net does, we can consider both abosulte & relative value by weighted sum.