-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathocrf.tex
1661 lines (1426 loc) · 124 KB
/
ocrf.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{One Class Splitting Criteria for Random Forests}
\label{chap:ocrf}
We recall that this is a contribution of heuristic nature and not yet supported by statistical results. This ongoing work has not been published yet and will certainly be completed in the near future, but we believe that it has its place in our manuscript, given the convincing empirical experiments we carried out and the rationale behind the approach promoted we gave.
\begin{chapabstract}
This chapter presents the details relative to the introducing section~\ref{resume:ocrf}.
Random Forests (RFs) are strong machine learning tools for classification and regression. However, they remain supervised algorithms, and no extension of RFs to the one-class setting has been proposed, except for techniques based on second-class sampling.
This work fills this gap by proposing a natural methodology to extend standard splitting criteria to the one-class setting, structurally generalizing RFs to one-class classification.
%Our model is consistent with the two-class framework from which our approach can be recovered, considering the asymptotic behavior of an adaptive outliers generating process.
An extensive benchmark of seven state-of-the-art anomaly detection algorithms is also presented. This empirically demonstrates the relevance of our approach.
\end{chapabstract}
Note: The material of this chapter is based on submitted work \citep{OCRF16}.
\section{Introduction}
\label{ocrf:sec:intro}
\textit{Anomaly detection} generally aims at finding patterns/observations in data
that do not conform to the expected behavior.
% The analysis is usually performed
% by assuming
% that
Anomalies are usually assumed to lie in low probability regions of the data generating
process.
%
% by assuming that the dataset under study contains at most a
% \textit{small} number of anomalies, generated by distribution models that
% \textit{differ} from the one generating other data. %vast majority of the data.
% % anomalies are a \textit{small} number of observations generated by \textit{different} models from the one generating the rest of the data
% %\sout{ -- the only difference in novelty detection is that the novel patterns are incorporated into the normal model after being detected. }.
% % This formulation motivates many statistical anomaly detection methods, based
% % on the underlying assumption that anomalies occur in low probability regions of
% % the data generating process. Here and hereafter, the term `normal data' does
% % not refer to Gaussian distributed data, but to \emph{not abnormal} ones,
This assumption drives many statistical anomaly detection methods. % , assuming
% that anomalies are found in low probability regions of the data generating
% process.
Parametric techniques \citep{Barnett94, Eskin2000} suppose that
the normal data are generated by a distribution belonging to some specific
parametric model a priori known.
%
Here and hereafter, the term `normal data' does not refer to the Gaussian
distributed data, but rather to \emph{not abnormal} ones, \ie~data belonging to the
above mentioned majority.
%
Classical non-parametric approaches are based on density (level set) estimation
\citep{Scholkopf2001, Scott2006, Breunig2000LOF, Quinn2014},
on dimensionality reduction \citep{Shyu2003, Aggarwal2001} or on decision trees
\citep{Liu2008, Shi2012}.
%
Relevant overviews of current research on anomaly detection can be found in
\cite{Hodge2004survey, Chandola2009survey, Patcha2007survey, Markou2003survey}.
The algorithm proposed in this chapter lies in the \emph{novelty detection}
setting, also called \emph{semi-supervised} anomaly detection or
\emph{one-class classification}. In this framework, we assume that we only observe examples of one
class (referred as the normal class, or inlier class). The second (hidden) class is called the abnormal class, or outlier class.
% , and that objects from the other class
% (the abnormal class, or outlier class) are uniformly distributed.
The goal is to identify characteristics of the normal class, such as its
support or some density level sets with levels close to zero.
%
This setup is for instance used in some (non-parametric) kernel methods such as
One-Class Support Vector Machine algorithm (OCSVM) \citep{Scholkopf2001}, which extends
the SVM methodology \citep{Cortes1995,Shawe2004} to handle training using only normal observations (see Section~\ref{back:ocsvm}).
Recently, Least Squares Anomaly Detection (LSAD) \citep{Quinn2014} similarly extends a multi-class probabilistic classifier \citep{Sugiyama2010} to the one-class setting.
%
% it treats the origin as the only member of the second class (after mapping the data to some feature space). Thus the OCSVM finds a separating hyperplane between the origin and the mapped one class, using some `relaxation parameters'.
% The OCSVM consists in estimating Minimum Volume sets, which amounts to estimating density level sets, if the density has no flat parts.
Both OCSVM and LSAD algorithms extend \emph{structurally} the corresponding classification
framework, namely without artificially creating a second class to
fall back on a two-class problem. %Estimating level sets of the underlying distribution boils down to assume the outlier distribution to be uniform. In other words, the one-class SVM extends the two-class one by assuming the existence of a uniform second class, without generating the latter. todo consequence de commenter ca?
The methodology proposed in this work applies the same structural effort to
Random Forests (RFs).
RFs \citep{Breiman2001} are estimators that fit a number of decision tree
classifiers on different random sub-samples of the dataset.
Each tree is built recursively, according to a splitting criterion based on
some impurity measure of a node.
The prediction is done by an average over each tree prediction.
In classification the averaging is based on a majority vote. % Empirical studies (see for instance~\cite{Breiman2001, Svetnik2003}) have shown that
RFs are strong machine learning tools, comparing well with state-of-the-art
methods such as SVM or boosting algorithms \citep{Freund1996}, and used in
a wide range of domains \citep{Svetnik2003, Diaz2006, Genuer2010}. Practical and
theoretical insights on RFs are given in \cite{Genuer2008, Biau2008, Louppe2014, Biau2016}.
Yet few attempts have been made to transfer the idea of RFs to one-class
classification \citep{Desir12, Liu2008, Shi2012}.
%, Clemencon2014}. On ne cite pas clemencon2014 car leur algorythm treeRank n'est pas des RFs (base estimator = non-symetric ranking trees) (en fait unsupervisedTreeRank c'est juste une modif triviale de TreeRank en considerant des volumes)
%
In \cite{Liu2008}, the novel concept of \emph{isolation} is introduced: the Isolation Forest algorithm isolates anomalies, instead of profiling the normal behavior which is the usual approach. It avoids adapting splitting rules to the one-class setting by using extremely randomized trees, also named extra trees \citep{Geurts2006}: isolation trees are built completely randomly, without any splitting rule. % chosing one feature and one uniform value over this feature to split on.
Therefore, Isolation Forest is not really based on RFs, the base estimators being extra trees instead of classical decision trees. However, Isolation Forest performs very well in practice with low memory and time complexities.
%SH Thus it is basically not a pure one-class algorithm aiming at estimating characteritics of one class, but rather a true outlier detector which applies ideally on training data polluted by anomalies.
% Isolation is done by randomly selecting a feature and then a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path
% length from the root node to the terminating node.
% Such a random partitioning produces noticeable shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
In \cite{Desir12, Shi2012}, outliers are generated to artificially form a second
class.
%
In \cite{Desir12} the authors propose a technique to reduce the number of
outliers needed by shrinking the dimension of the input space. The outliers
are then generated from the reduced space using a distribution complementary to the `normal'
distribution. Thus their algorithm artificially generates a second class, to
use classical RFs.
%
In \cite{Shi2012}, two different outliers generating processes are compared.
In the first one, an artificial second class is added by randomly sampling
from the product of empirical marginal distributions. In the second one
outliers are uniformly generated from the hyper-rectangle that contains the
observed data. The first option is claimed to work best in practice, which can
be understood from the curse of dimensionality argument:
%
in large dimension \citep{Tax2002}, when the outliers distribution is not tightly defined around the target set,
the chance for an outlier to be in the target set becomes very small, so that a huge number of outliers is needed.
%
%In other words, an unreasonable amount of artificial outliers has to be generated, for being dense enough to cover all possible `outlier behavior', \ie~dense enough to transform the two-class classification task into a (one-class) support estimation task.
Looking beyond the RF literature, \cite{Scott2006} propose a methodology to build dyadic decision trees to estimate minimum-volume sets \citep{Polonik97, Einmahl1992}. This is done by reformulating their structural risk minimization problem to be able to use \cite{Blanchard2004}'s algorithm.
While this methodology can also be used for non-dyadic trees pruning (assuming such a tree has been previously constructed, \eg~using some greedy heuristic), it does not allow to effectively grow such trees. A dyadic structure has to be assumed, to be able to 1) link the tree growing process with the general optimization problem and 2) to control the complexity of the resulting partitions, with respect to the regularity of the underlying distribution. In other words, this strong tree structure is needed to derive theoretical guaranties.
% This methodology can also be used for non-dyadic trees, by assuming such a tree has been
% previously constructed (\eg~using some greedy heuristic). Optimization then reduces to a classical pruning problem.
% While strong theoretical guaranties are derived, the pruning procedure does not allow to control the shape of the tree, the latter being determined by the initial building process.
In the same spirit, \cite{CLEM14} proposes to use the two-class splitting criterion defined in \cite{Clemencon2009Tree}. This two-class splitting rule aims at producing oriented decision trees with a `left-to-right' structure to address the bipartite ranking task.
Extension to the one-class setting is done by assuming a uniform distribution for the outlier class. This left-to-right structure is needed to reduce the tree building process to a recursive optimization procedure, thus allowing to derive consistency and rate bounds.
% Besides the fact that the base estimators considered in this reference are different than the standard decision trees used in RFs, the main difference with our approach is that we do not assume any fixed outlier distribution, and we consider one-class versions of .
%
Thus, in these two references \citep{Scott2006, CLEM14}, the priority is given to the theoretical analysis. This imposes constraints on the tree structure which becomes far from the general structure of the base estimators in RF. The price to pay is in the flexibility of the model, and its ability to capture complex broader patterns or structural characteristics from the data.
In this paper, we make the choice to stick to the RF framework. We do not assume any structure for the binary decision trees. The price to pay is the lack of theoretical guaranties, the gain is that we keep the flexibility of RF and are thus able to compete with state-of-the-art anomaly detection algorithms.
Besides, we do not assume any (fixed in advance) outlier distribution as in \cite{CLEM14}, but define it in an adaptive way during the tree building process.
To the best of our knowledge, no algorithm structurally extends (without second class sampling and without alternative base estimators) RFs to one-class classification. %, as OCSVM does for standard SVM.
Here we precisely % the main purpose of this work is to
introduce such a methodology. It builds on a natural adaptation of two-class % Gini improvement proxy, yielding a one-class Gini-based criterion specially designed for
splitting criteria to the one-class setting, as well as an adaptation of the two-class majority vote.
\textbf{Basic idea.} To split a node without second class (outliers) examples, the idea is as follows.
Each time we are looking for the best split for a node $t$, we simply replace (in the two-class \emph{impurity decrease} to be maximized \eqref{ocrf:tc_proxy}) the second class proportion going to the left child node $t_L$ by the proportion expectation $\leb(t_L)/\leb(t)$ (idem for the right node), where $\leb(t)$ denotes the volume of the rectangular cell corresponding to node $t$.
It ensures that one child node tries to capture the maximum number of observations with a minimal volume, while the other child looks for the opposite. %Such volume constraint/mass objective are similarly used to define minimum-volume sets \citep{Polonik97, Einmahl1992}.
This simple idea corresponds in fact to an adaptive modeling of the outlier distribution.
The proportion expectation mentioned above being weighted proportionally to the number of normal instances in node $t$, the resulting outlier distribution is tightly concentrated around the inliers. %as a result, the latter concentrates outside, closely around but also inside the support of the normal distribution.
%
Besides, and this attests the consistency of our approach with the two-class framework, it turns out that the one-class model promoted here corresponds to the asymptotic behavior of an adaptive (\wrt~the tree growing process) outliers generating methodology.
This chapter is structured as follows. Section~\ref{ocrf:sec:background} provides the reader with necessary background, to address Section~\ref{ocrf:sec:one-class} which proposes a generic adaptation of RFs to the one-class setting and describes a generic one-class random forest algorithm. The latter is compared empirically with state-of-the-art anomaly detection methods in Section~\ref{ocrf:sec:benchmark}. Finally a theoretical justification of the one-class criterion is given in Section~\ref{sec:ocrf:theory}.
%
\section{Background on decision trees}\label{ocrf:sec:background}
%% TODO DROUGUI: why RANDOM forests??
%\subsection{Binary trees}
%Let $\mb X = (X_1, \ldots, X_d)$ a random vector in
Let us denote by $\mathcal{X} \subset \mathbb{R}^d$ the $d$-dimensional hyper-rectangle containing all the observations.
Consider a binary tree on $\mathcal{X}$ whose node values are subsets of $\mathcal{X}$, iteratively produced by splitting $\mathcal{X}$ into two disjoint subsets.
Each internal node $t$ with value $\mathcal{X}_t$ is labeled with a split feature $m_t$ and split value $c_t$ (along that feature), in such a way that it divides $\mathcal{X}_t$ into two disjoint spaces $\mathcal{X}_{t_L} := \{x \in \mathcal{X}_t, x_{m_t} < c_t \}$ and $\mathcal{X}_{t_R} := \{x \in \mathcal{X}_t, x_{m_t} \ge c_t \}$, where $t_L$ (resp. $t_R$) denotes the left (resp. right) children of node $t$, and $x_j$ denotes the $j$th coordinate of vector $x$. Such a binary tree is grown from a sample $ X_1, \ldots, X_n$ (for all $i$, $ X_i \in \mathcal{X}$) and its finite depth is determined either by a fixed maximum depth value or by a stopping criterion evaluated on the nodes (\textit{e.g.} based on an impurity measure).
The external nodes (or the \emph{leaves}) form a partition of the input space $\mathcal{X}$.
In a supervised classification setting, these binary trees are called \emph{classification trees} and
prediction is made by assigning to each sample $x \in \mathcal{X}$ the majority class of the leaves containing $x$. This is called the \emph{majority vote}.
Classification trees are usually built using an impurity measure $i(t)$ whose decrease is maximized at each split of a node $t$, yielding an optimal split $(m_t^*, c_t^*)$. The decrease of impurity (also called \emph{goodness of split}) $\Delta i(t, t_L, t_R)$ \wrt~the split $(m_t, c_t)$ corresponding to the partition $\mathcal{X}_t=\mathcal{X}_{t_L}\sqcup \mathcal{X}_{t_R}$ of the node $t$ is defined as
\begin{align}
\label{ocrf:eq:impurity_measure_decrease}
\Delta i(t, t_L, t_R) = i(t) - p_L i(t_L) - p_R i(t_R),
\end{align}
where $p_L = p_L(t)$ (resp. $p_R = p_R(t)$) is the proportion of samples from $\mathcal{X}_t$ going to $\mathcal{X}_{t_L}$ (resp. to $\mathcal{X}_{t_R}$). The impurity measure $i(t)$ reflects the goodness of node $t$: the smaller $i(t)$, the purer the node $t$ and the better the prediction by majority vote on this node. Usual choices for $i(t)$ are the Gini index \citep{Gini1912} and the Shannon entropy \citep{Shannon2001}.
To produce a randomized tree, these optimization steps are usually partially randomized (conditionally on the data, splits $(m_t^*, c_t^*)$'s become random variables), and a classification tree can even be grown totally randomly \citep{Geurts2006}.
%
In a two-class classification setup, the Gini index is
\begin{align}
\label{ocrf:eq:gini}
i_G(t) = 2\left(\frac{n_t}{n_t + n_t'}\right) \left( \frac{n_t'}{n_t + n_t'}\right) = 1 - \frac{n_t^2 + n_t'^2}{(n_t + n_t')^2},
\end{align}
where $n_t$ (resp. $n_t'$) stands for the number of observations with label $0$ (resp. $1$) in node $t$. The Gini index is maximal when $n_t/(n_t + n_t') = n_t'/(n_t + n_t')=0.5$, namely when the conditional probability to have label $0$ given that we are in node $t$ is the same as to have label $0$ unconditionally: the node $t$ does not discriminate at all between the two classes. %% TODO: DROUGUI: p_empirique(label 0 | X_t) = n_t/(n_t+n_t')=0.5 ???=??? p_empirique(label 0) = n/(n+n') ?? pas si n > n' !!
For a node $t$, maximizing the impurity decrease~\eqref{ocrf:eq:impurity_measure_decrease}
is equivalent to minimizing $p_L i(t_L) + p_R i(t_R)$.
As $p_L = (n_{t_L} + n_{t_L}') / (n_t + n_t')$ and $p_R = (n_{t_R} + n_{t_R}')/(n_t + n_t')$, and the quantity $(n_t + n_t')$ being constant in the optimization problem, this
is equivalent to minimizing the following proxy of the impurity decrease:
\begin{align}
\label{ocrf:eq:two_class_proxy}
I(t_L, t_R) = (n_{t_L} + n_{t_L}') i(t_L) + (n_{t_R} + n_{t_R}') i(t_R).
\end{align}
Note that if we consider the Gini index as the impurity criteria, the corresponding proxy of the impurity decrease is
\begin{align}
\label{ocrf:tc_proxy}
I_G(t_L, t_R)= \frac{n_{t_L} n'_{t_L}}{n_{t_L} + n'_{t_L}} + \frac{n_{t_R} n'_{t_R}}{n_{t_R} + n'_{t_R}}.
\end{align}
%
In the one-class setting, no label is available, hence the impurity measure $i(t)$ % \eqref{ocrf:eq:impurity_measure_decrease}
does not apply to this setup. The standard splitting criterion which consists in minimizing the latter cannot be used anymore.
% A naïve idea to solve this problem is to generate $n'$ samples uniformly on the hyper-rectangle that contains the data (the root node cell of a tree on the entire dataset) to form artificially a second class, and to consider the number $n'_t$ of these uniform observations that fall in node $\mathcal{X}_t$.
% Equivalently, we can also replace the second class instances number $n_t'$ by the expectation of the number of uniform observations falling in $\mathcal{X}_t$ among $n'$ when generating $n'$
% A natural way to extend the supervised classification framework to this unsupervised problem is to generate outliers in order to artificially form a second class. As no prior knowledge is available for determining the outliers distribution, the most natural approach is to estimate the level sets of the underlying distribution (observations are \iid~realizations of this distribution, potentially polluted by outliers). In other words and as mentioned above, the reference measure should be the Lebesgue measure, meaning that outliers should be generated according to an uniform distribution, over some region containing the target set, namely the support/level set of the normal distribution.
% However (see \cite{Tax2002}), in large dimension, when the outlier distribution is not very tightly defined around the target set, the chance for an outlier to be in the target set becomes very small, so that a huge number of outliers are needed (curse of dimensionality).
% %
% In other words, a unreasonable amount of artificial outliers have to be generated, for being dense enough to cover all possible `outlier behavior', \ie~dense enough to transform the two-class classification task into a (one-class) support estimation task.
% In other words, with such generated data, the two-class classifier should be able to discriminate the observed class from \emph{any} other possible distribution. However, the number of outliers needed to `surround' normal data increase drastically with the dimension (curse of dimensionality).
%In the next section, we propose a method to efficiently generate outliers, step by step, directly in dense regions of the trees.
%Then, considering the limit behavior of this generating process into the optimization problem at stake, yields a one-class proxy version of the impurity measure used to optimally splitting. Equipped with such one-class splitting rule, we avoid sampling outliers.
\section{Adaptation to the one-class setting}
\label{ocrf:sec:one-class}
%SH In this section, we present a theoretical ground for the one-class setting, yielding one-class versions of impurity measures. As outliers are supposed uniformly distributed, the abnormality measure corresponds to the Lebesgue measure. To circumvent the curse of dimensionality inherent in the consideration of this measure, an abnormality weight interpreted as the expected proportion of (non-observed) outliers is adaptively chosen in the tree building process \wrt~the volume of the node at stake. Consistence is achieved \wrt~the standard two-class framework from which our model can be recovered by simply replacing empirical quantities (relative to the non-observed abnormal behavior) by their expectation, when generating outliers in a specific efficient way during the tree building process. %The latter generating methodology is also a by-product of our approach.
% This yields a one-class random forest algorithm, called OneClassRF, which \emph{structurally} extends random forest to one-class classification.
% %, in the same spirit one-class SVM does with standard SVM.
% %
The two reasons why RFs do not apply to one-class classification are that the standard splitting criterion does not apply to this setup, as well as the majority vote. In this section, we propose a one-class splitting criterion and a natural one-class version of the majority vote.
%\subsection{One-class splitting criterion}
\subsection{One-class splitting criterion}
\label{sec:one-class-crit}
As one does not observe the second-class (outliers), $n_t'$ needs to be defined. In the naive approach below, it is defined as $n_t':= n' \leb(\mathcal{X}_t) / \leb(\mathcal{X})$, where $n'$ is the supposed total number of (hidden) outliers. In the adaptive approach hereafter, it will be defined as $n_t' := \gamma n_t$, with typically $\gamma=1$. Thus, the class ratio $\gamma_t := n_t'/n_t$ is defined in both approaches and goes to $0$ when $\leb(\mathcal{X}_t) \to 0$ in the naive approach, while it is maintained constant $\gamma_t \equiv \gamma$ in the adaptive one.
\paragraph{Naive approach.}
A naive approach to extend the Gini splitting criterion to the one-class setting is to assume a uniform distribution for the second class (outliers), and to replace their number $n_t'$ in node $t$ by the expectation $n' \leb(\mathcal{X}_t) / \leb(\mathcal{X})$, where $n'$ denotes the total number of outliers (for instance, it can be chosen as a proportion of the number of inliers).
Here and hereafter, $\leb$ denotes the Lebesgue measure on $\mathbb{R}^d$.
%
The problem with this approach appears when the dimension is \emph{not small}. As mentioned in the introduction (curse of dimensionality),
when actually generating $n'$ uniform outliers on $\mathcal{X}$, the probability that a node (sufficiently small to yield a good precision) contains at least one of them is very close to zero. That is why data-dependent distributions for the outlier class are often considered \citep{Desir12, Shi2012}.
%
Taking the expectation $n' \leb(\mathcal{X}_t) / \leb(\mathcal{X})$ instead of the number of points in node $t$ does not solve the curse of dimensionality mentioned in the introduction:
the volume proportion $L_t:=\leb(\mathcal{X}_t) / \leb(\mathcal{X})$ is very close to $0$ for nodes $t$ deep in the tree, specially in large dimension.
%
In addition, we typically grow trees on sub-samples of the input data, meaning that even the root node of the trees may be very small compared to the hyper-rectangle containing all the input data.
%
An other problem is that the Gini splitting criterion is skew-sensitive \citep{Flach2003}, and has here to be apply on nodes $t$ with $0 \simeq n_t' \ll n_t$. When trying empirically this approach, we observe that splitting such nodes produces a child containing (almost) all the data (see Section~\ref{sec:ocrf:theory}).
%
\begin{remark}
To illustrate the fact that the volume proportion $L_t:=\leb(\mathcal{X}_t) / \leb(\mathcal{X})$ becomes very close to zero in large dimension for lots of nodes $t$ (in particular the leaves), suppose for the sake of simplicity that the input space is $\mathcal{X} = [0,1]^d$. Suppose that we are looking for a rough precision of $1/2^3=0.125$ in each dimension, \ie~a unit cube precision of $2^{-3d}$.
To achieve such a precision, the splitting criterion has to be used on nodes/cells $t$ of volume of order $2^{-3d}$, namely with $L_t = 1/2^{3d}$.
Note that if we decide to choose $n'$ to be $2^{3d}$ times larger than the number of inliers in order that $n' L_{t}$ is not negligible \wrt~the number of inliers, the same (reversed) problem of unbalanced classes appears on nodes with small depth.
\end{remark}
\paragraph{Adaptive approach.}
Our solution is to remove the uniform assumption on the outliers, and to choose their distribution adaptively in such a way it is tightly concentrated around the inlier distribution. Formally, the idea is to maintain constant the class ratio $\gamma_t := n_t' / n_t$ on each node $t$: before looking for the best split, we update the number of outliers to be equal (up to a scaling constant $\gamma$) to the number of inliers, $n_t' = \gamma n_t$, \ie~$\gamma_t \equiv \gamma$. These (hidden) outliers are uniformly distributed on node $t$. The parameter $\gamma$ is typically set to $\gamma = 1$, see Remark~\ref{ocrf:rk:gamma}.
\textbf{Resulting density.} Figure~\ref{ocrf:fig:outlier_density} shows the corresponding outlier density $G$.
Note that $G$ is a piece-wise approximation of the inlier distribution $F$. Considering the Neyman-Pearson test $X \sim F$ vs. $X \sim G$ instead of $X \sim F$ vs. $X \sim \leb$ may seem unusual at first sight. However, note that there is $\epsilon>0$ such that $G>\epsilon$ on the entire input space, since the density $G$ is constant on each node and equal to the average of $F$ on this node \emph{before splitting it}. If the average of $F$ was estimated to be zero (no inlier in the node), the node would not have been splitted, from where the existence of $\epsilon$.
%
Thus, one might think of $G$ as
% Note that $G$ on some leaf $t_m$ with parents $t_0, \ldots, t_{m-1}$ takes the form $G_{|t_m} = \sum_{j=0}^{m} \int_{\mathcal{X}_{t_i}} F = 1/\leb{\mathcal{X}} + \sum_{j=1}^{m} \int_{\mathcal{X}_{t_i}} F $
a piece-wise approximation of $F_\epsilon := (1 - \epsilon) F + \epsilon \leb$. Yet, one can easily show that optimal tests for the Neyman-Pearson problem $H_0: X \sim F$ vs. $H_1: X \sim F_\epsilon$ are identical to the optimal tests for $H_0: X \sim F$ vs. $H_1: X \sim \leb$, since the corresponding likelihood ratios are related by a monotone transformation, see \cite{Scott2009} for instance (in fact, this reference shows that these two problems are even equivalent in terms of consistency and rates of convergence of the learning rules).
With this methodology, one cannot derive a one-class version of the Gini index \eqref{ocrf:eq:gini}, but we can define a one-class version of the proxy of the impurity decrease \eqref{ocrf:tc_proxy}, by simply replacing $n_{t_L}'$ (resp. $n_{t_R}'$) by $n_t' \lambda_L$ (resp. $n_t' \lambda_R$), where $\lambda_L := \leb(\mathcal{X}_{t_L}) / \leb(\mathcal{X}_{t})$ and $\lambda_R := \leb(\mathcal{X}_{t_R}) / \leb(\mathcal{X}_{t})$ are the volume proportion of the two child nodes:
\begin{align}
\label{ocrf:oc_proxy_ad2}
I_G^{OC-ad}(t_L, t_R)= \frac{n_{t_L} \gamma n_t \lambda_L}{n_{t_L} + \gamma n_t \lambda_L} + \frac{n_{t_R} \gamma n_t \lambda_R}{n_{t_R} + \gamma n_t \lambda_R}.
\end{align}
Minimization of the one-class Gini improvement proxy \eqref{ocrf:oc_proxy_ad2} is illustrated in Figure \ref{ocrf:fig:split_alpha}. %and \ref{ocrf:fig:split_two_modes}.
Note that $n_t'\lambda_L$ (resp. $n_t'\lambda_R$) is the expectation of the number of uniform observations (on $\mathcal{X}_t$) among $n_t'$ falling into the left (resp. right) node.
Choosing the split minimizing $I_G^{OC-ad}(t_L, t_R)$ at each step of the tree building process, corresponds to generating $n_t' = \gamma n_t$ outliers each time the best split has to be chosen for node $t$, and then using the classical two-class Gini proxy \eqref{ocrf:tc_proxy}. The only difference is that $n_{t_L}'$ and $n_{t_R}'$ are replaced by their expectations $n_t'\lambda_{t_L}$ and $n_t'\lambda_{t_R}$ in our method.
\begin{remark}({\sc By-product: Efficiently generating outliers})
As a by-product, we obtain an efficient method to generate outliers tightly concentrated around the support of the normal distribution: it suffices to generate them as described above, recursively during the tree building process. Sampling $n_t'$ uniform points on $\mathcal{X}_t$, then using the latter to find the best split \wrt~\eqref{ocrf:tc_proxy}, and recommence on $\mathcal{X}_{t_L}$ and $\mathcal{X}_{t_R}$.
\end{remark}
\begin{remark}({\sc Extension to other impurity criteria})
Our extension to the one-class setting also applies to other impurity criteria. For instance, in the case of the Shannon entropy defined in the two-class setup by
%\begin{align*}
%\label{ocrf:eq:shannon}
$i_S(t) = \frac{n_t}{n_t + n_t'} \log_2 \frac{n_t + n_t'}{n_t} + \frac{n_t'}{n_t + n_t'} \log_2 \frac{n_t + n_t'}{n_t'},$
%\end{align*}
the one-class impurity improvement proxy becomes
%\begin{align*}
%\label{ocrf:eq:one_class_shannon_proxy}
$I_S^{OC-ad}(t_L, t_R) = n_{t_L} \log_2 \frac{n_{t_L} + \gamma n_t \lambda_L}{n_{t_L}} + n_{t_R} \log_2 \frac{n_{t_R} + \gamma n_t \lambda_R}{n_{t_R}}.$
%\end{align*}
\end{remark}
\begin{figure}[!ht]
\centering
\begin{tikzpicture}[scale=0.6,declare function={
%%cauchy distrib
c1 = 6/10;
c2 = 4/10;
seuil = c1/(c1+c2);
a1 = 0.1;
a2 = 0.2;
x01 = 0.8;
x02 = 2.9;
cauchyMass1(\x) = c1*a1/( pi*( pow(a1,2) + pow((\x-x01),2) ));
cauchyMass2(\x) = c2*a2/( pi*( pow(a2,2) + pow((\x-x02),2) ));
loinormale(\x) = c2*exp(-pow((\x-x02),2)/0.05);
loinormale2(\x) = c2*exp(-pow((\x-x02-1.5),2)/0.3);
indicatorFunction(\x) = exp(-pow(\x-3,2)/6);
%%params
firstVerticalSplitX = 1;
lastVerticalSplitX = 5.7;
verticalDashedSplitX = 2.3;
verticalSplitX = 4;
lowHorizontalDashedSplitY = 4.5;
highHorizontalDashedSplitY = 5.5;
coeffHomothety = 3.8;
homothetyBone = -1.7;
homothetyBtwo = -14;
},
]
\definecolor{niceblue}{rgb}{0.4,0.4,0.9}
\definecolor{blue2}{rgb}{0.9,1,1}
\definecolor{ggreen}{rgb}{0.3,0.7,0.4}
\definecolor{orange2}{rgb}{1,0.7,0}
%%%%%%%%%%%%%%%% FIRST curve
\fill [blue2, domain=-7.9:-1.65, variable=\x]
(-7.9, 1)
-- plot[samples=200,smooth] ({\x},{ 1.5*coeffHomothety*loinormale((\x-homothetyBone+15)/coeffHomothety) +1 } )
-- (-1.47, 1)
-- cycle;
\fill [blue2, domain=-5.8:-1.65, variable=\x]
(-5.85, 1)
-- plot[samples=200,smooth] ({\x},{ 1.5*0.633*coeffHomothety*cauchyMass2((\x-homothetyBone+15)/coeffHomothety) +1 } )
-- (-1.47, 1)
-- cycle;
\draw[->,>=latex] (-7.9,1) to (-7.9,3.2);
\draw[->,>=latex] (-7.9,1) to (-1.35,1);
\draw [domain=-7.9:-5.8, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ 1.5*coeffHomothety*loinormale((\x-homothetyBone+15)/coeffHomothety) +1} );
\draw [domain=-5.85:-1.65, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ 1.5*0.633*coeffHomothety*cauchyMass2((\x-homothetyBone+15)/coeffHomothety) +1} );
%% splits
\draw[color=orange,thick] (-7.9,1.5) -- (-1.65,1.5);
\node[color=orange] at (-7.2,1.8){G};
\node[color=niceblue] at (-4.8,3){F};
\node at (-3.2,3){\footnotesize naive};
\node at (-3.2,2.5){\footnotesize approach};
\draw (-4.2,2.2) -- (-2.2,2.2) -- (-2.2,3.3) -- (-4.2,3.3) -- (-4.2,2.2);
\node at (0.3,3){\footnotesize adaptive};
\node at (0.3,2.5){\footnotesize approach};
\draw (-0.7,2.2) -- (1.3,2.2) -- (1.3,3.3) -- (-0.7,3.3) -- (-0.7,2.2);
\node at (0.4,1.6) {\scalebox{2}{$\longrightarrow$}};
%%%%%%%%%%%%%%%% second curve
\fill [blue2, domain=1.6:7.95, variable=\x]
(1.6, 1)
-- plot[samples=200,smooth] ({\x},{ 1.5*coeffHomothety*loinormale((\x-homothetyBone+5.5)/coeffHomothety) +1 } )
-- (8.03, 1)
-- cycle;
\fill [blue2, domain=3.65:7.85, variable=\x]
(3.65, 1)
-- plot[samples=200,smooth] ({\x},{ 1.5*0.633*coeffHomothety*cauchyMass2((\x-homothetyBone+5.5)/coeffHomothety) +1 } )
-- (8.03, 1)
-- cycle;
\draw[->,>=latex] (1.6,1) to (1.6,3.2);
\draw[->,>=latex] (1.6,1) to (8.15,1);
\draw [domain=1.6:3.7, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ 1.5*coeffHomothety*loinormale((\x-homothetyBone+5.5)/coeffHomothety) +1} );
\draw [domain=3.65:7.85, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ 1.5*0.633*coeffHomothety*cauchyMass2((\x-homothetyBone+5.5)/coeffHomothety) +1} );
\node[color=niceblue] at (4.7,3){F};
\draw[color=orange,thick] (1.6,2) -- (5.5,2);
\draw[color=orange,thick] (5.5,1.17) -- (7.85,1.17);
\node[color=orange] at (2.2,2.4){G};
%% splits
\draw (5.5,0.8) -- (5.5,3.6);
%%%%%%%%%%%%%%%%%% third curve
\fill [blue2, domain=9.1:15.45, variable=\x]
(9.1, 1)
-- plot[samples=200,smooth] ({\x},{ 1.5*coeffHomothety*loinormale((\x-homothetyBone-2)/coeffHomothety) +1 } )
-- (15.53, 1)
-- cycle;
\fill [blue2, domain=11.15:15.35, variable=\x]
(11.15, 1)
-- plot[samples=200,smooth] ({\x},{ 1.5*0.633*coeffHomothety*cauchyMass2((\x-homothetyBone-2)/coeffHomothety) +1 } )
-- (15.53, 1)
-- cycle;
\draw[->,>=latex] (9.1,1) to (9.1,3.2);
\draw[->,>=latex] (9.1,1) to (15.65,1);
\draw [domain=9.1:11.2, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ 1.5*coeffHomothety*loinormale((\x-homothetyBone-2)/coeffHomothety) +1} );
\draw [domain=11.15:15.35, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ 1.5*0.633*coeffHomothety*cauchyMass2((\x-homothetyBone-2)/coeffHomothety) +1} );
\node[color=niceblue] at (12.2,3){F};
\node at (7.7,1.6) {\scalebox{2}{$\longrightarrow$}};
\draw[color=orange,thick] (9.1,1.1) -- (10,1.1);
\draw[color=orange,thick] (10,2.3) -- (13,2.3);
\draw[color=orange,thick] (13,1.28) -- (14,1.28);
\draw[color=orange,thick] (14,1.12) -- (15.35,1.12);
\node[color=orange] at (9.7,2.4){G};
%% splits
\draw (13,0.8) -- (13,3.6);
\draw (14,0.8) -- (14,3.6);
\draw (10,0.8) -- (10,3.6);
\end{tikzpicture}
\caption{\small Outliers distribution $G$ in the naive and adaptive approach. In the naive approach, $G$ does not depends on the tree and is constant on the input space. In the adaptive approach the distribution depends on the inlier distribution $F$ through the tree. The outliers density is constant and equal to the average of $F$ on each node before splitting it.
}
\label{ocrf:fig:outlier_density}
\end{figure}
% \begin{figure}[!ht]
% \centering
% \includegraphics[width=0.8\linewidth]{fig_source/ocrf_fig/outlier_density.png}
% \caption{Outliers assumed distribution $G$ in the naive and adaptive approach. In the naive approach, $G$ does not depends on the tree and is constant on the input space. In the adaptive approach the distribution depends on the inlier distribution $F$ through the tree. The outliers density is constant and equal to the average of $F$ on each node before splitting it. Note that $G$ is tightly concentrated around the inliers, as required in high dimension.
% %In the case of more than one tree, the resulting outlier distribution is then the average of such tree-specific densities.
% }
% \label{ocrf:fig:outlier_density}
% \end{figure}
\pgfmathsetseed{7}
\begin{figure}
\center
\begin{tikzpicture}[scale=1,declare function={
%%cauchy distrib
c1 = 6/10;
c2 = 4/10;
seuil = c1/(c1+c2);
a1 = 0.1;
a2 = 0.2;
x01 = 0.8;
x02 = 2.9;
cauchyMass1(\x) = c1*a1/( pi*( pow(a1,2) + pow((\x-x01),2) ));
cauchyMass2(\x) = c2*a2/( pi*( pow(a2,2) + pow((\x-x02),2) ));
cauchyRepFuncInv1(\x) = a1*tan( 3.142*(\x-0.5) r) + x01;
cauchyRepFuncInv2(\x) = a2*tan( 3.142*(\x-0.5) r) + x02;
indicatorFunction(\x) = exp(-pow(\x-3,2)/6);
%%params
firstVerticalSplitX = 1;
lastVerticalSplitX = 5.7;
verticalDashedSplitX = 2.3;
verticalSplitX = 4;
lowHorizontalDashedSplitY = 4.5;
highHorizontalDashedSplitY = 5.5;
coeffHomothety = 3.8;
homothetyBone = -1.7;
homothetyBtwo = -14;
},
]
\definecolor{niceblue}{rgb}{0.4,0.4,0.9}
\definecolor{blue2}{rgb}{0.9,1,1}
%%% draw area
\clip (-0.03,0.5) rectangle (13.8,6.8);
%%% TOP RECTANGLES
\draw[thick] (0,3.4) rectangle (6.5,6.7);
\draw[thick] (7,3.4) rectangle (13.5,6.7);
\node at (0.25,3.8) {$\mathcal{X}$};
\node at (7.25,3.8) {$\mathcal{X}_t$};
\node[color=niceblue] at (verticalSplitX+0.3,lowHorizontalDashedSplitY-0.2) {$\mathcal{X}_t$};
%%%%%%%%%%%%%%%%%% LEFT PART
%%% points sampling
\foreach \x in {1,2,...,350}{
\pgfmathsetmacro{\seuil}{c1/(c1+c2)}
\pgfmathsetmacro{\aleatorio}{rnd}
\pgfmathsetmacro{\rndCauchy}{\aleatorio>seuil ? 0 : 1 }
\pgfmathsetmacro{\abscissePoint}{\rndCauchy*cauchyRepFuncInv1(rand) + (1-\rndCauchy)*cauchyRepFuncInv2(rand)}
\pgfmathsetmacro{\ordinatePoint}{\rndCauchy*(1.5*rand+5) + (1-\rndCauchy)*(rand*0.4+5)}
\pgfmathsetmacro{\abscissePointFiltered}{ \abscissePoint>6.6 ? -10 : \abscissePoint }
\pointSampled{\abscissePointFiltered,\ordinatePoint}
\pgfmathsetmacro{\rightEnough}{\abscissePoint>verticalDashedSplitX ? true : false }
\pgfmathsetmacro{\leftEnough}{\abscissePoint<verticalSplitX ? true : false }
\pgfmathsetmacro{\highEnough}{\ordinatePoint>lowHorizontalDashedSplitY ? true : false}
\pgfmathsetmacro{\lowEnough}{\ordinatePoint<highHorizontalDashedSplitY ? true : false}
\pgfmathsetmacro{\newAbscisse}{\rightEnough && \leftEnough && \highEnough && \lowEnough ? \abscissePoint*coeffHomothety + homothetyBone : -10 }
\pointSampled{\newAbscisse,\ordinatePoint*coeffHomothety + homothetyBtwo}
}
%% curve
\fill [blue2, domain=0.1:6.33, variable=\x]
(0.1, 1)
-- plot[samples=200,smooth] ({\x},{cauchyMass1(\x) + cauchyMass2(\x) +1} )
-- (6.33, 1)
-- cycle;
\draw [domain=0.1:6.33, scale=1, color=niceblue, line width=1pt, fill=blue2] plot[samples=200,smooth] (\x,{cauchyMass1(\x) + cauchyMass2(\x) +1});
%% axis
\draw[->,>=latex] (0.1,1) to (0.1,3.2);
\draw[->,>=latex] (0.1,1) to (6.5,1);
%% splits
\draw (firstVerticalSplitX,0.9) -- (firstVerticalSplitX,6.7); % gamma=10
\draw (verticalSplitX,0.9) -- (verticalSplitX,6.7); % gamma=1
%\draw (lastVerticalSplitX,0.9) -- (lastVerticalSplitX,6.7); % gamma=0.1
%\node[below] at (firstVerticalSplitX,1){\footnotesize $\gamma=10$};
%\node[below] at (verticalSplitX,1){\footnotesize $\gamma=1$};
%\node[below] at (lastVerticalSplitX+0.5,1){\footnotesize $\gamma=0.1$};
\draw (verticalDashedSplitX,0.9) -- (verticalDashedSplitX,6.7);
\draw (0,lowHorizontalDashedSplitY) -- (6.5,lowHorizontalDashedSplitY);
\draw (0,highHorizontalDashedSplitY) -- (6.5,highHorizontalDashedSplitY);
%% ZOOM
\draw[very thick, color=niceblue] (verticalDashedSplitX,lowHorizontalDashedSplitY) rectangle (verticalSplitX,highHorizontalDashedSplitY);
\draw[very thick, dashed, color=niceblue,->,>=latex] (verticalDashedSplitX,highHorizontalDashedSplitY) -- (7,6.7);
\draw[very thick, dashed, color=niceblue,->,>=latex] (verticalDashedSplitX,lowHorizontalDashedSplitY) -- (7,3.4);
%%%%%%%%%%%%%%%%%% RIGHT PART second curve
\fill [blue2, domain=7.1:13.53, variable=\x]
(7.1, 1)
-- plot[samples=200,smooth] ({\x},{ indicatorFunction((\x-7)/5)*coeffHomothety*1.5*cauchyMass2((\x-homothetyBone)/coeffHomothety) +1 } )
-- (13.53, 1)
-- cycle;
\draw[->,>=latex] (7.1,1) to (7.1,3.2);
\draw[->,>=latex] (7.1,1) to (13.7,1);
\draw [domain=7.1:13.53, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ indicatorFunction((\x-7)/5)*1.5*coeffHomothety*cauchyMass2((\x-homothetyBone)/coeffHomothety) +1} );
%% splits
\draw (verticalSplitX+7.25,0.9) -- (verticalSplitX+7.25,6.7); % gamma=1
\draw (13.36,0.9) -- (13.36,6.7); % gammat
\node[below] (gammaone) at (verticalSplitX+7.25,1){\footnotesize $\gamma=1$};
\node[below] (gammat) at (13.36,1){\footnotesize $\gamma_t \simeq 0$};
\draw[dashed] (7.1,1.23) -- (verticalSplitX+7.25,1.23);
\node[right] at (6.65,1.35){\footnotesize $t_{\gamma}$};
\draw[->,>=latex, very thick] (13.3,1.75) to (verticalSplitX+7.3,1.75);
\node at (verticalSplitX+8.4,2) {\textbf{adaptivity}};
\end{tikzpicture}
\caption{ The left part of this figure represents
the dataset under study and the underlying density.
After some splits on this initial node $\mathcal{X}$,
let us consider the node $\mathcal{X}_t$ illustrated in the right part of this figure:
without the proposed adaptive approach, the class ratio
$\gamma_t$ becomes too small
and leads to poor splits %(normal data are condensed on a very small volume)
(all the data are in the `normal side' of the split, which thus does not discriminate at all).
Contrariwise, setting $\gamma$ to one, \textit{i.e.} using our adaptive approach,
is far preferable.
Note that a given $\gamma$ corresponds to a level set $t_{\gamma}$.}
\label{ocrf:fig:split_alpha}
\end{figure}
\subsection{Prediction: a majority vote with one single candidate?}
\label{ocrf:sec:prediction}
Now that RFs can be grown in the one-class setting using our one-class splitting criterion, the forest has to return a prediction adapted to this framework.
In other words we also need to extend the concept of majority vote.
%
Most usual one-class (or more generally anomaly detection) algorithms actually provide more than just a level-set estimate or a predicted label for any new observation, abnormal vs. normal. Instead, they return a real valued function, termed \textit{scoring function}, defining a pre-order/ranking on the input space. Such a function $s: \mathbb{R}^d \to \mathbb{R}$ permits to rank any observations according to their supposed `degree of abnormality'. Thresholding it provides level-set estimates, as well as a decision rule that splits the input space into `normal' and `abnormal' regions.
%
%We thus adapt the majority vote to the one-class setting by defining the scoring function of a forest.
%
%We propose three natural scoring functions of a forest.
The scoring function $s(x)$ we use is the one defined in \cite{Liu2008}. It is a decreasing function of the average depth of the leaves containing $x$ in the forest, `if the trees were fully grown': an average term is added to each node containing more than one sample, say containing $N$ samples. This term $c(N)$ is the average depth of an extremely randomized tree \citep{Geurts2006} (\ie~built without minimizing any criterion, by randomly choosing one feature and one uniform value over this feature to split on) on $N$ samples. Formally,
\begin{align}
\label{ocrf:eq:scoring3}
\log_2 s(x) = -\left(\sum_{t \text{~leaves}} \mathds{1}_{\{ x \in t \}} d_t + c(n_t)\right) ~/~ c(n),
\end{align}
where $d_t$ is the depth of node $t$, and $c(n) = 2H(n - 1) - 2(n - 1)/n$, $H(i)$ being the harmonic number.
%
\begin{remark}[{\sc Alternative Scoring Functions}]
Although we use the scoring function defined in \eqref{ocrf:eq:scoring3} because of its established high performance \citep{Liu2008}, other scoring functions can be defined.
%\textbf{Score 1- Stepwise density estimate.}
A natural idea to adapt the majority vote to the one-class setting is to change the single vote of a leaf node $t$ into the fraction $\frac{n_t}{\leb(\mathcal{X}_t)}$, the forest output being the average of the latter quantity over the forest,
% \begin{align}
% \label{ocrf:eq:scoring1}
$s(x) = \sum_{t \text{~leaves}} \mathds{1}_{\{ x \in t \}} \frac{n_t}{\leb(\mathcal{X}_t)}.$
% \end{align}
In such a case, each tree of the forest yields a piece-wise density estimate, on its induced partition. % However, we are only interested in the order induced by this density estimate, and if the average of density estimates should be a good density estimate, this does not remains true for the average of order: in general, the average of scoring functions is not a good scoring function.
The output produced by the forest is then a \emph{step-wise density estimate}.
% , potentially good, but which is not (according to our experiments) as good as \eqref{ocrf:eq:scoring} as a scoring function. In other words, this density estimate is not accurate in estimating the support (or large level-sets) of the distribution.
% One possible explanation is that while averaging density estimate yields generally good density estimate, this is not necessary true when averaging orders/rankings. To see this, considering two scoring functions $s_1$ and $s_2$, $(s_1 + s_2)/2$ correspond to a different ranking than $(2s_1 + s_2)/2$, while $2s_1$ and $s_1$ correspond to the same ranking.
% %Besides, estimating a support using the support of a density average overestimate it, because it consists of the union of all the supports involved.
% In other terms, the average of scoring functions is not necessary a good scoring function.
% \end{remark}
%
%\textbf{Score 2- Local density of a typical cell.}
We could also think about the \emph{local density of a typical cell}.
For each point $x$ of the input space, it returns the average number of observations in the leaves containing $x$, divided by the average volume of such leaves.
The output of OneClassRF is then the scoring function
% \begin{align}
% \label{ocrf:eq:scoring2}
$s(x) = \big(\sum_{t \text{~leaves}} \mathds{1}_{\{ x \in t \}} n_t\big) \big(\sum_{t \text{~leaves}} \mathds{1}_{\{ x \in t \}} \leb(\mathcal{X}_t)\big)^{-1} ,$
% \end{align}
where the sums are over each leave of each tree in the forest.
This score can be interpreted as the local density of a `typical' cell (typical among those usually containing $x$).
%
%Instead of $\frac{n_t}{\leb(\mathcal{X}_t)}$, we could have consider for a leaf node $t_L$ with parent node $t$, the maximum between $n_{t_L}$ and $n_t'\lambda_L$ or the fraction $\frac{n_{t_L}}{n_t'\lambda_L}$ which is a more natural adaptation of the majority vote. However, the latter quantity is equal to $\frac{n_{t_L}}{\leb(\Omega_{t_L})} / \frac{n_t'}{\leb(\Omega_t)}$, the rapport between the density of the leaf and the density of the parent node.
\end{remark}
\subsection{OneClassRF: a Generic One-Class Random Forest algorithm}
Let us summarize our One Class Random Forest algorithm, based on generic RFs \citep{Breiman2001}. It has $6$ parameters:
$max\_samples$, $max\_features\_tree$, $max\_features\_node$, $\gamma$, $max\_depth$, $n\_trees$.
Each tree is classically grown on a random subset of both the input samples and the input features \citep{Ho1998, Panov2007}.
This random subset is a sub-sample of size $max\_samples$, with $max\_features\_tree$ variables chosen at random without replacement (replacement is only done after the tree is grown). The tree is built by minimizing \eqref{ocrf:oc_proxy_ad2} for each split, using parameter $\gamma$ (recall that $n_t' := \gamma n_t$), until the maximal depth $max\_depth$ is achieved.
% define a large number of geometric features and search over a random selection
% of these for the best split at each node
Minimizing \eqref{ocrf:oc_proxy_ad2} is done as introduced in \cite{Amit1997}, defining a large number $max\_features\_node$ of geometric features and searching over a random selection of these for the best split at each node.
%
The forest is composed of a number $n\_trees$ of trees. The predicted score of a point $x$ is given by $s(x)$, the $s$'s being defined in Section~\ref{ocrf:sec:prediction}.
%SH (copier en suppl mat) As an illustration, Figure~\ref{ocrf:fig:oneclassrf} represents the level set of the scoring function produced by OneClassRF, with only one tree ($n\_trees$$=1$) of maximal depth $max\_depth$=4, without sub-sampling, and using the Gini-based one-class splitting criterion with $\gamma=1$.
% Remarks on the underlying level-set estimation, alternative stopping criteria and variable importances are available in supplementary material.
Figure~\ref{ocrf:fig:oneclassrf} represents the level set of the scoring function produced by OneClassRF, with only one tree ($n\_trees$$=1$) of maximal depth $max\_depth$=4, without sub-sampling, and using the Gini-based one-class splitting criterion with $\gamma=1$.
\begin{figure}[ht]
\centering
\includegraphics[width=0.5\linewidth]{fig_source/ocrf_fig/oneclassrf.png}
\caption{OneClassRF with one tree: level-sets of the scoring function}
\label{ocrf:fig:oneclassrf}
\end{figure}
% \begin{remark}({\sc Underlying Level-Set estimation})
% As mentionned above, the link between $\gamma$ and the level of the level-set estimate induced by the split is difficult to exhibit, as well as how it repercutes into the level of the global level-set estimate.
% Intuitively, $\gamma$ plays the same role as parameter $\nu$ for the OCSVM in its $\nu$-soft margin formulation.
% OCSVM also returns a scoring function (inducing an infinite number of level-sets), but which seeks to be optimal for estimating a particular level-set (corresponding to a particular thresholding of the scoring function) whose level can explicitely be controlled by $\nu$.
% However, we are not able to derive an explicit relation, as it exists for the OCSVM, between the targeted level set and $\gamma$. Note that this does not mean that we are not able to estimate any arbitrary level-set. It only means that we do not know for which level, the scoring function outputed by OneClassRF originally seeks to be optimal.
% \end{remark}
\begin{remark}({\sc Interpretation of $\gamma$})
\label{ocrf:rk:gamma}
In order for the splitting criterion \eqref{ocrf:oc_proxy_ad2} to perform well, $n_t'$ is expected to be of the same order of magnitude as the number of normal observations $n_t$. If $\gamma = n_t'/n_t \ll 1$, the split puts every normal data on the same side, even the ones which are far in the tail of the distribution, thus widely over-estimating the support of normal data. If $\gamma \gg 1$, the opposite effect happens, yielding an estimate of a $t$-level set with $t$ not close enough to $0$.
Figure~\ref{ocrf:fig:split_alpha}
illustrates the splitting criterion when $\gamma$ varies. It clearly shows that there is a link between parameter $\gamma$ and the level $t_\gamma$ of the induced level-set estimate. But from the theory, an explicit relation between $\gamma$ and $t_\gamma$ is hard to derive. By default we set $\gamma$ to $1$.
%
% The outlier sampling size $\gamma = n_t' / n_t$ is a parameter of the forest which controls at each split the level of the (local) estimated level-set (induced by the split), as illustrated in Figure~\ref{ocrf:fig:split_alpha}. %The more there is outliers, the higher the corresponding level. % the estimated level set/density support fit to the training data.
% The OneClassRF algorithm % takes as parameter the fraction $\alpha = n_t'/n_t$,
% fixed by default $\gamma=1$. %this parameter to $n_t'=n_t$.
One could object that in some situations, it is useful to randomize this parameter. For instance, in the case of a bi-modal distribution for the normal behavior, one split of the tree needs to separate two clusters, in order for the level set estimate to distinguish between the two modes. As illustrated in Figure~\ref{ocrf:fig:split_alpha_2}
, it can only occur if $n_t'$ is large with respect to $n_t$ ($\gamma >> 1$). However, the randomization of $\gamma$ is somehow included in the randomization of each tree, thanks to the sub-sampling inherent to RFs.
%
Moreover, small clusters tend to vanish when the sub-sample size is sufficiently small: a small sub-sampling size is used in \cite{Liu2008} to isolate outliers even when they form clusters.
\end{remark}
\pgfmathsetseed{7}
\begin{figure}
\center
\begin{tikzpicture}[scale=1,declare function={
%%cauchy distrib
c1 = 6/10;
c2 = 4/10;
seuil = c1/(c1+c2);
a1 = 0.1;
a2 = 0.2;
x01 = 0.8;
x02 = 2.9;
cauchyMass1(\x) = c1*a1/( pi*( pow(a1,2) + pow((\x-x01),2) ));
cauchyMass2(\x) = c2*a2/( pi*( pow(a2,2) + pow((\x-x02),2) ));
cauchyRepFuncInv1(\x) = a1*tan( 3.142*(\x-0.5) r) + x01;
cauchyRepFuncInv2(\x) = a2*tan( 3.142*(\x-0.5) r) + x02;
indicatorFunction(\x) = exp(-pow(\x-3,2)/6);
%%params
firstVerticalSplitX = 1;
lastVerticalSplitX = 5.7;
verticalDashedSplitX = 2.3;
verticalSplitX = 4;
lowHorizontalDashedSplitY = 4.5;
highHorizontalDashedSplitY = 5.5;
coeffHomothety = 3.8;
homothetyBone = -1.7;
homothetyBtwo = -14;
},
]
\definecolor{niceblue}{rgb}{0.4,0.4,0.9}
\definecolor{blue2}{rgb}{0.9,1,1}
%%% draw area
\clip (-0.03,0.5) rectangle (7,6.8);
%%% TOP RECTANGLES
\draw[thick] (0,3.4) rectangle (6.5,6.7);
%\draw[thick] (7,3.4) rectangle (13.5,6.7);
\node at (0.25,3.8) {$\mathcal{X}$};
%\node at (7.25,3.8) {$\mathcal{X}_t$};
%\node[color=niceblue] at (verticalSplitX+0.3,lowHorizontalDashedSplitY-0.2) {$\mathcal{X}_t$};
%%%%%%%%%%%%%%%%%% LEFT PART
%%% points sampling
\foreach \x in {1,2,...,350}{
\pgfmathsetmacro{\seuil}{c1/(c1+c2)}
\pgfmathsetmacro{\aleatorio}{rnd}
\pgfmathsetmacro{\rndCauchy}{\aleatorio>seuil ? 0 : 1 }
\pgfmathsetmacro{\abscissePoint}{\rndCauchy*cauchyRepFuncInv1(rand) + (1-\rndCauchy)*cauchyRepFuncInv2(rand)}
\pgfmathsetmacro{\ordinatePoint}{\rndCauchy*(1.5*rand+5) + (1-\rndCauchy)*(rand*0.4+5)}
\pgfmathsetmacro{\abscissePointFiltered}{ \abscissePoint>6.6 ? -10 : \abscissePoint }
\pointSampled{\abscissePointFiltered,\ordinatePoint}
%\pgfmathsetmacro{\rightEnough}{\abscissePoint>verticalDashedSplitX ? true : false }
%\pgfmathsetmacro{\leftEnough}{\abscissePoint<verticalSplitX ? true : false }
%\pgfmathsetmacro{\highEnough}{\ordinatePoint>lowHorizontalDashedSplitY ? true : false}
%\pgfmathsetmacro{\lowEnough}{\ordinatePoint<highHorizontalDashedSplitY ? true : false}
%\pgfmathsetmacro{\newAbscisse}{\rightEnough && \leftEnough && \highEnough && \lowEnough ? \abscissePoint*coeffHomothety + homothetyBone : -10 }
%\pointSampled{\newAbscisse,\ordinatePoint*coeffHomothety + homothetyBtwo}
}
%% curve
\fill [blue2, domain=0.1:6.33, variable=\x]
(0.1, 1)
-- plot[samples=200,smooth] ({\x},{cauchyMass1(\x) + cauchyMass2(\x) +1} )
-- (6.33, 1)
-- cycle;
\draw [domain=0.1:6.33, scale=1, color=niceblue, line width=1pt, fill=blue2] plot[samples=200,smooth] (\x,{cauchyMass1(\x) + cauchyMass2(\x) +1});
%% axis
\draw[->,>=latex] (0.1,1) to (0.1,3.2);
\draw[->,>=latex] (0.1,1) to (6.5,1);
%% splits
\draw (firstVerticalSplitX,0.9) -- (firstVerticalSplitX,6.7); % gamma=10
\draw (verticalSplitX,0.9) -- (verticalSplitX,6.7); % gamma=1
\draw (lastVerticalSplitX,0.9) -- (lastVerticalSplitX,6.7); % gamma=0.1
\node[below] at (firstVerticalSplitX,1){\footnotesize $\gamma=10$};
\node[below] at (verticalSplitX,1){\footnotesize $\gamma=1$};
\node[below] at (lastVerticalSplitX+0.5,1){\footnotesize $\gamma=0.1$};
%\draw (verticalDashedSplitX,0.9) -- (verticalDashedSplitX,6.7);
%\draw (0,lowHorizontalDashedSplitY) -- (6.5,lowHorizontalDashedSplitY);
%\draw (0,highHorizontalDashedSplitY) -- (6.5,highHorizontalDashedSplitY);
%% ZOOM
%\draw[very thick, color=niceblue] (verticalDashedSplitX,lowHorizontalDashedSplitY) rectangle (verticalSplitX,highHorizontalDashedSplitY);
%\draw[very thick, dashed, color=niceblue,->,>=latex] (verticalDashedSplitX,highHorizontalDashedSplitY) -- (7,6.7);
%\draw[very thick, dashed, color=niceblue,->,>=latex] (verticalDashedSplitX,lowHorizontalDashedSplitY) -- (7,3.4);
%%%%%%%%%%%%%%%%%% RIGHT PART second curve
\vide{
\fill [blue2, domain=7.1:13.53, variable=\x]
(7.1, 1)
-- plot[samples=200,smooth] ({\x},{ indicatorFunction((\x-7)/5)*coeffHomothety*1.5*cauchyMass2((\x-homothetyBone)/coeffHomothety) +1 } )
-- (13.53, 1)
-- cycle;
\draw[->,>=latex] (7.1,1) to (7.1,3.2);
\draw[->,>=latex] (7.1,1) to (13.7,1);
\draw [domain=7.1:13.53, scale=1, color=niceblue, line width=1pt] plot[samples=200,smooth] (\x,{ indicatorFunction((\x-7)/5)*1.5*coeffHomothety*cauchyMass2((\x-homothetyBone)/coeffHomothety) +1} );
%% splits
\draw (verticalSplitX+7.25,0.9) -- (verticalSplitX+7.25,6.7); % gamma=1
\draw (13.36,0.9) -- (13.36,6.7); % gammat
\node[below] (gammaone) at (verticalSplitX+7.25,1){\footnotesize $\gamma=1$};
\node[below] (gammat) at (13.36,1){\footnotesize $\gamma_t$};
\draw[dashed] (7.1,1.23) -- (verticalSplitX+7.25,1.23);
\node[right] at (6.65,1.35){\footnotesize $t_{\gamma}$};
\draw[->,>=latex, very thick, color=niceblue] (13.3,1.3) to[bend right] (verticalSplitX+7.3,1.3);
\node at (verticalSplitX+8.2,2) {\color{niceblue} \textbf{adaptivity}};
}
\end{tikzpicture}
\caption{Illustration of the standard splitting
criterion on two modes when the proportion $\gamma$ varies.}
\label{ocrf:fig:split_alpha_2}
\end{figure}
\begin{remark}({\sc Alternative Stopping Criteria})
Other stopping criteria than a maximal depth may be considered. We could stop splitting a node $t$ when it contains less than $n\_min$ observations, or when the quantity $n_t/\leb(\mathcal{X}_t)$ is large enough (all the points in the cell $\mathcal{X}_t$ are likely to be normal) or close enough to $0$ (all the points in the cell $\mathcal{X}_t$ are likely to be abnormal). These options are not discussed in this work.
\end{remark}
\begin{remark}({\sc Variable importance})
In the multi-class setting, \cite{Breiman2001} proposed to evaluate the importance of a feature $j \in \{1,\ldots d\}$ for prediction by %computing the following quantity. XXX
adding up the weighted impurity decreases % $p(t) \Delta I(s_t, t)$
for all nodes $t$ where $X_j$ is used, averaged over all the trees. The analogue quantity can be computed with respect to the one-class impurity decrease proxy. % $I_{oc}(s_t, t)$.
In our one-class setting, this quantity represents the size of the tail of $X_j$, and can be interpreted as the capacity of feature $j$ to discriminate between normal/abnormal data.%, or the importance of feature $j$ to isolate anomalies
\end{remark}
%by \eqref{ocrf:eq:scoring1}, \eqref{ocrf:eq:scoring2} or \eqref{ocrf:eq:scoring3}.
% \paragraph{Description of the OneClassRF algorithm.}
% Now that we have exposed the one-class splitting rule to grow the trees, as well as the forest output replacing the majority vote, the one-class random forest algorithm promoted here is described as follows. %similar to Breiman's approach, see \cite{Breiman2001}.
% % Let us denote by \ttt{X} the input data.
% In its most generic version, it has five parameters, $max\_samples$, $max\_features$, $n\_estimators$, $max\_depth$ and $\gamma$.
% Each tree is classicaly grown on a random subset of both the input samples and the input variables, see~\cite{Ho1998, Panov2007}. This random subset is a sub-sample of size $max\_samples$, with $max\_features$ variables chosen at random without replacement (replacement is only done after the tree is grown). The tree is built by minimizing \eqref{ocrf:oc_proxy_ad2} at each step using parameter $\gamma$, until the maximal depth $max\_depth$ is achieved. The forest is composed of a number $n\_estimators$ of trees. The predicted score of a point $x$ is given by \eqref{ocrf:eq:scoring}, namely the average (over the trees) of $n_t/\leb(\mathcal{X}_t)$, where $t$ is the leaf node (of the tree considered) containing $x$.
% % \begin{algorithm}[OneClassRF.fit]~\\
% % \begin{pythoncode}
% % Inputs: X
% % Parameters: max_samples, max_features, n_estimators, max_depth
% % Output: A set of n_estimators trees.
% % \end{pythoncode}
% % \end{algorithm}
% \begin{remark}({\sc Extremely Randomized Trees})
% In the case of extremely randomized trees (see \cite{Geurts2006}), namely trees grown totally randomly as used in Isolation Forest (\cite{Liu2008}), no splitting criterion is needed so that no impurity function is used. The one class version of the majority vote can still be used in this case, which yields a scoring function of the form $s(x) = \sum_t \mathds{1}_{\{\Omega_t \}} \frac{N_t}{\leb(\Omega_t)}$. XXX benchmark on this algo? (easy: iforest with score from OCSVM)
% \end{remark}
%XXX sklearn exemple with OneClassRF with different alpha
\section{Benchmarks}
%
\label{ocrf:sec:benchmark}
\begin{table}[ht]
\caption{Original Datasets characteristics}
\label{ocrf:table:data}
\centering
\footnotesize
%\tabcolsep=0.08cm
\resizebox{0.9\linewidth}{!} {
\begin{tabular}{lccll}
\toprule
Datasets & nb of samples & nb of features & ~~~~~~~~~~~~~~~~~~~~~~~~~anomaly class & ~ \\ \midrule
adult & 48842 & 6 & class '$>50K$' & (23.9\%) \\
annthyroid & 7200 & 6 & classes $\neq$ 3 & (7.42\%) \\
arrhythmia & 452 & 164 & classes $\neq$ 1 (features 10-14 removed)& (45.8\%) \\
forestcover & 286048 & 10 & class 4 (vs. class 2 ) & (0.96\%) \\
http & 567498 & 3 & attack & (0.39\%) \\
ionosphere & 351 & 32 & bad & (35.9\%) \\
pendigits & 10992 & 16 & class 4 & (10.4\%) \\
pima & 768 & 8 & pos (class 1) & (34.9\%) \\
shuttle & 85849 & 9 & classes $\neq$ 1 (class 4 removed) & (7.17\%) \\
smtp & 95156 & 3 & attack & (0.03\%) \\
spambase & 4601 & 57 & spam & (39.4\%) \\
wilt & 4839 & 5 & class 'w' (diseased trees) & (5.39\%) \\
\bottomrule
\end{tabular}
}
\end{table}
%
\begin{table}[!h]
\caption{Results for the novelty detection setting (semi-supervised framework).
%We compare various methods from the state-of-the-art (top line) to OneClassRF over different classical datasets (leftmost column).
The table reports AUC ROC and AUC PR scores (higher is better) for each algorithms.
%We use the set of hyper-parameters recommanded in the corresponding reference papers.
The training time of each algorithm has been limited (for each experiment among the 10 performed for each dataset) to 30 minutes,
where `NA' indicates that the algorithm could not finish training within the allowed time limit.
In average on all the datasets, our proposed algorithm `OneClassRF' achieves both best AUC ROC and AUC PR scores (with LSAD for AUC ROC). It also achieves the lowest cumulative training time.}
\label{ocrf:table:results-semisupervised}
\centering
%\scriptsize
\tabcolsep=0.1cm
\resizebox{\linewidth}{!} {
\begin{tabular}{ l c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c c@{\extracolsep{0.1cm}}c }
\toprule
%
Dataset & \multicolumn{2}{c }{OneClassRF} & \multicolumn{2}{c }{iForest} & \multicolumn{2}{c }{OCRF\small{sampl.}} & \multicolumn{2}{c }{OCSVM}& \multicolumn{2}{c }{LOF}& \multicolumn{2}{c }{Orca}& \multicolumn{2}{c }{LSAD}& \multicolumn{2}{c }{RFC} \\%& parameters $(\epsilon, k)$\\
\cmidrule{1-17}
~ & ROC & PR & ROC & PR & ROC & PR & ROC & PR & ROC & PR &ROC & PR & ROC & PR & ROC & PR \\
adult & \textbf{0.665} & \textbf{0.278} & 0.661 & 0.227 & NA & NA & 0.638 & 0.201 & 0.615 & 0.188 & 0.606 & 0.218 & 0.647 & 0.258 & NA & NA \\
annthyroid & \textbf{0.936} & 0.468 & 0.913 & 0.456 & 0.918 & \textbf{0.532} & 0.706 & 0.242 & 0.832 & 0.446 & 0.587 & 0.181 & 0.810 & 0.327 & NA & NA \\
arrhythmia & 0.684 & 0.510 & 0.763 & 0.492 & 0.639 & 0.249 & \textbf{0.922} & \textbf{0.639} & 0.761 & 0.473 & 0.720 & 0.466 & 0.778 & 0.514 & 0.716 & 0.299 \\
forestcover & 0.968 & 0.457 & 0.863 & 0.046 & NA & NA & NA & NA & \textbf{0.990} & \textbf{0.795} & 0.946 & 0.558 & 0.952 & 0.166 & NA & NA \\
http & \textbf{0.999} & \textbf{0.838} & 0.994 & 0.197 & NA & NA & NA & NA & NA & NA & \textbf{0.999} & 0.812 & 0.981 & 0.537 & NA & NA \\
ionosphere & 0.909 & 0.643 & 0.902 & 0.535 & 0.859 & 0.609 & 0.973 & 0.849 & 0.959 & 0.807 & 0.928 & \textbf{0.910} & \textbf{0.978} & 0.893 & 0.950 & 0.754 \\
pendigits & 0.960 & 0.559 & 0.810 & 0.197 & 0.968 & 0.694 & 0.603 & 0.110 & 0.983 & 0.827 & \textbf{0.993} & \textbf{0.925} & 0.983 & 0.752 & NA & NA \\
pima & 0.719 & 0.247 & 0.726 & 0.183 & \textbf{0.759} & \textbf{0.266} & 0.716 & 0.237 & 0.700 & 0.152 & 0.588 & 0.175 & 0.713 & 0.216 & 0.506 & 0.090 \\
shuttle & \textbf{0.999} & \textbf{0.998} & 0.996 & 0.973 & NA & NA & 0.992 & 0.924 & \textbf{0.999} & 0.995 & 0.890 & 0.782 & 0.996 & 0.956 & NA & NA \\
smtp & 0.922 & 0.499 & 0.907 & 0.005 & NA & NA & 0.881 & \textbf{0.656} & \textbf{0.924} & 0.149 & 0.782 & 0.142 & 0.877 & 0.381 & NA & NA \\
spambase & \textbf{0.850} & 0.373 & 0.824 & 0.372 & 0.797 & \textbf{0.485} & 0.737 & 0.208 & 0.746 & 0.160 & 0.631 & 0.252 & 0.806 & 0.330 & 0.723 & 0.151 \\
wilt & 0.593 & 0.070 & 0.491 & 0.045 & 0.442 & 0.038 & 0.323 & 0.036 & 0.697 & 0.092 & 0.441 & 0.030 & 0.677 & 0.074 & \textbf{0.896} & \textbf{0.631} \\
%internet_ads & & & & & & & & & &\\
\cmidrule{1-17}
average: & \textbf{0.850} & \textbf{0.495} & 0.821 & 0.311 & 0.769 & 0.410 & 0.749 & 0.410 & 0.837 & 0.462 & 0.759 & 0.454 & \textbf{0.850} & 0.450 & 0.758 & 0.385 \\
cum. train time: & \multicolumn{2}{c }{\textbf{61s}} & \multicolumn{2}{c }{68s} & \multicolumn{2}{c }{NA} & \multicolumn{2}{c }{NA}& \multicolumn{2}{c }{NA}& \multicolumn{2}{c }{2232s}& \multicolumn{2}{c }{73s}& \multicolumn{2}{c }{NA} \\
\bottomrule
\end{tabular}
}
\end{table}
In this section, we compare the OneClassRF algorithm described above to seven state-of-art anomaly detection algorithms:
the isolation forest algorithm \citep{Liu2008} (iForest), a one-class RFs algorithm based on sampling a second-class \citep{Desir12} (OCRFsampling), one class SVM \citep{Scholkopf2001} (OCSVM), local outlier factor \citep{Breunig2000LOF} (LOF), Orca \citep{Bay2003}, Least Squares Anomaly Detection \citep{Quinn2014} (LSAD), % EvOutSe (Evolutionary Outlier Search \cite{Aggarwal2001})
Random Forest Clustering \citep{Shi2012} (RFC).
%We have used default parameters for these algorithms, as detailed in supplementary material.
\subsection{Default parameters of OneClassRF.}
The default parameters taken for our algorithm are the followings.
%
$max\_samples$ is fixed to $20\%$ of the training sample size (with a minimum of $100$); $max\_features\_tree$ is fixed to $50\%$ of the total number of features with a minimum of $5$ (\ie~each tree is built on $50\%$ of the total number of features); $max\_features\_node$ is fixed to $5$; $\gamma$ is fixed to $1$ (see Remark~\ref{ocrf:rk:gamma}); $max\_depth$ is fixed to $\log_2$ (logarithm in base $2$) of the training sample size as in \cite{Liu2008}; $n\_trees$ is fixed to $100$ as in the previous reference; and parameter $s_i$ is set to $s_3$ as defined in \eqref{ocrf:eq:scoring3}.
\subsection{Hyper-Parameters of tested algorithms}
Overall we chose to train the different algorithms with their (default) hyper-parameters as seen
in their respective paper or author's implementation.
The \emph{OCSVM} algorithm uses default parameters: \verb+kernel='rbf'+, \verb+tol=1e-3+,
\verb+nu=0.5+, \verb+shrinking=True+, \verb+gamma=1/n_features+, where tol is the tolerance for stopping criterion.
The \emph{LOF} algorithm uses default parameters:
\verb+n_neighbors=5+, \verb+leaf_size=30+, \verb+metric='minkowski'+,
\verb+contamination=0.1+, \verb+algorithm='auto'+, where the algorithm parameters stipulates
how to compute the nearest neighbors (either ball-tree, kd-tree or brute-force).
The \emph{iForest} algorithm uses default parameters:
\verb+n_estimators=100+, \verb+max_samples=min(256, n_samples)+,
\verb+max_features=1+, \verb+bootstrap=false+, where bootstrap states whether samples are drawn with replacement.
The \emph{OCRFsampling} algorithm uses default parameters:
the number of dimensions for the Random Subspace Method \verb+krsm=-1+,
the number of features randomly selected at each node during the induction of the tree \verb+krfs=-1+,
\verb+n_tree=100+,
the factor controlling the extension of the outlier domain used for outlier generation according to the volume of the hyper-box surrounding the target data \verb+alpha=1.2+,
the factor controlling the number of outlier data generated according to the number of target data \verb+beta=10+,
whether outliers are generated from uniform distribution \verb+optimize=0+,
whether data outside target bounds are considered as outlier data \verb+rejectOutOfBounds=0+.
The \emph{Orca} algorithm uses default parameter \verb+k=5+ (number of nearest neighbors)
as well as \verb+N=n/8+ (how many anomalies are to be reported).
The last setting, set up in the empirical evaluation of iForest in \cite{Liu2012},
allows a better computation time without impacting Orca's performance.
The \emph{RFC} algorithm uses default parameters:
\verb+no.forests=25+, \verb+no.trees=3000+,
the Addcl1 Random Forest dissimilarity \verb+addcl1=T, addcl2=F+,
use the importance measure \verb+imp=T+,
the data generating process \verb+oob.prox1=T+,
the number of features sampled at each split \verb+mtry1=3+.
The \emph{LSAD} algorithm uses default parameters:
the maximum number of samples per kernel \verb+n_kernels_max=500+,
the center of each kernel (the center of the random sample subset by default) \verb+kernel_pos='None'+,
the kernel scale parameter (using the pairwise median trick by default)\verb+gamma='None'+,
the regularization parameter \verb+rho=0.1+.
\subsection{Description of the datasets}
%
The characteristics of the twelve reference datasets considered here are summarized
in Table~\ref{ocrf:table:data}. They are all available on the UCI repository
\citep{Lichman2013} and the preprocessing is done in a classical way. %, excepting for the \emph{adult} dataset. For the latter, we considered the 6 continuous features.
We removed all non-continuous attributes as well as attributes taking less than $10$ different values.
%
The \emph{http} and \emph{smtp} datasets belong to the KDD Cup '99 dataset \citep{KDD99, Tavallaee2009}, which consist of a wide variety of hand-injected attacks (anomalies) in a closed network (normal background). They are classically obtained as described in \cite{Yamanishi2000}. This two datasets are available on the \emph{scikit-learn} library \citep{sklearn2011}.
The \emph{shuttle} dataset is the fusion of the training and testing datasets available in the UCI repository. As in \cite{Liu2008}, we use instances from all different classes but class $4$. %, which yields an anomaly ratio (class 1) of $7.17\%$.
In the \emph{forestcover} data, the normal data are the instances from class~$2$ while instances from class $4$ are anomalies (as in \cite{Liu2008}). %, which yields an anomaly ratio of $0.9\%$.
The \emph{ionosphere} dataset differentiates `good' from `bad' radars, considered here as abnormal. A `good' radar shows evidence of some type of structure in the ionosphere. A `bad' radar does not, its signal passing through the ionosphere.
The \emph{spambase} dataset consists of spam or non-spam emails. The former constitute our anomaly class.
The \emph{annthyroid} medical dataset on hypothyroidism contains one normal class and two abnormal ones, which form our outliers.
The \emph{arrhythmia} dataset reflects the presence and absence (class $1$) of cardiac arrhythmia. The number of attributes being large considering the sample size, we removed attributes containing missing data.
The \emph{pendigits} dataset contains 10 classes corresponding to the digits from 0 to 9, examples being handwriting samples. As in \cite{Schubert2012}, the abnormal data are chosen to be those from class 4.
The \emph{pima} dataset consists of medical data on diabetes. Patients suffering from diabetes (normal class) were considered outliers.
The \emph{wild} dataset involves detecting diseased trees in Quickbird imagery. Diseased trees (class `w') is our abnormal class.
In the \emph{adult} dataset, the goal is to predict whether income exceeds \$ 50K/year based on census data. We only keep the 6 continuous attributes.
%Concerning the \emph{internet_ads} dataset, it contains characteristic of images, which may (class `ad', outlier) or not (class `nonad') be an advertisement.
\subsection{Results}
\pgfplotsset{minor grid style={very thick,black}}
\begin{figure}[ht]
\definecolor{ggreen}{rgb}{0.3,0.7,0.4}
\definecolor{ggreen2}{rgb}{0.4,0.8,0.5}
\definecolor{orange2}{rgb}{1,0.7,0}
\definecolor{violette}{rgb}{0.7,0.15,0.9}
\begin{tikzpicture}[scale=0.6,font=\Large]
\begin{axis}[ at={(0,0)},
grid=minor,
width=23cm, height=6cm,
ybar=0pt,
minor xtick={0.5,1.5,...,12.5},
xmin=0.5, xmax=12.5,
xticklabels={ , , , , , , , , , , , },
ymin=0.58,
ytick={0.6,0.7,...,1},
ymax=1,
ylabel={ROC AUC},