-
Notifications
You must be signed in to change notification settings - Fork 11
/
Statistics.tex
1453 lines (1320 loc) · 64.4 KB
/
Statistics.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\ifx\wholebook\relax\else
\documentclass[twoside]{book}
\usepackage[active]{srcltx}
\usepackage[LY1]{fontenc}
\input{DhbDefinitions}
\begin{document}
\fi
\chapter{Elements of statistics}
\label{ch:statistics} \vspace{1 ex}
\begin{flushright} {\sl La statistique est la premi\`ere des sciences inexactes.}\footnote{Statistics is the first of the inexact sciences.}\\
Edmond et Jules de Goncourt
\end{flushright}
\vspace{1 ex} Statistical analysis comes into play when dealing
with a large amount of data. Obtaining information from the
statistical analysis of data is the subject of chapter
\ref{ch:estimation}. Some sections of chapter \ref{ch:datamining}
are also using statistics. Concepts needed by statistics are based
on probability theory.
This chapter makes a quick overview of the concepts of probability
theory. It is the third (and last) chapter of this book where most
of the material is not useful {\it per se}. Figure
\ref{fig:statisticsclasses} shows the classes described in this
chapter.
\begin{figure}
\centering\includegraphics[width=11cm]{Figures/StatisticsClasses}
\caption{Classes related to statistics}
\label{fig:statisticsclasses}
\end{figure}
All these classes, however, are used extensively in the remaining
chapters of this book. The example on how to use the code are kept
to a minimum since real examples of use can be found in the next
chapters.
An in-depth description of probability theory is beyond the scope
of this book. The reader in the need for additional should consult
the numerous textbooks on the subject, \cite{PhiTay} or
\cite{LawKel} for example.
\section{Statistical moments}
\label{sec:moments} When one measures the values of an observable
random variable, each measurement gives a different magnitude.
Assuming measurement errors are negligible, the fluctuation of the
measured values correspond to the distribution of the random
variable. The problem to be solved by the experimenter is to
determine the parameters of the distribution from the observed
values. {\sl Statistical moments} can contribute to the
characterization of the distribution\footnote{Central moments are
related to the coefficients of the Taylor expansion of the Fourier
transform of the distribution function.}.
Given a set of measurements, $x_1,\ldots,x_n$, of the values
measured for a random variable one defines the moment of $k\th$
order by:
\begin{equation}
M_k={1\over n}\sum_{i=1}^n x_i^k.
\end{equation}
In particular, the moment of first order is the mean or average of
the set of data:
\begin{equation}
\label{eq:average}
\bar{x}=M_1={1\over n}\sum_{i=1}^n x_i.
\end{equation}
The central moments of $k^{\mathop{\rm th}}$ order is defined by:
\begin{equation}
m_k={1\over n}\sum_{i=1}^n \left(x_i-\bar{x}\right)^k.
\end{equation}
where $k$ is larger than 1. The central moments are easily
expressed in terms of the moments. We have:
\begin{equation}
m_k=\sum_{j=0}^k \left({k \atop j}\right)\left(-\bar{x}\right)^{k-j}M_j,
\end{equation}
where $\left({k \atop j}\right)$ are the binomial coefficients.
Some statistical parameters are defined on the central moments.
The variance of a set of measurement is the central moment of
second order. The standard deviation, $s$, is the square root of
the variance given by the following formula:
\begin{equation}
s^2={\displaystyle n \over \displaystyle n-1}m_2 ={\displaystyle 1
\over \displaystyle n-1} \sum_{i=1}^n \left(x_i-\bar{x}\right)^2.
\end{equation}
The factor in front the central moment of second order is called
Bessel's correction factor. This factor removes the bias of the
estimation when the standard deviation is evaluated over a finite
sample. The standard deviation measures the spread of the data
around the average.
Many people believe that the standard deviation is the error of
the average. This is not true: the standard deviation describes
how much the data are spread around the average. It thus
represents the error of a single measurement. An estimation of the
standard deviation of the average value is given by the following
formula:
\begin{equation}
\label{eq:averagerror}
s_{\bar{x}}^2={\displaystyle s^2 \over n}
\mbox{\quad or \quad}
s_{\bar{x}}={\displaystyle s \over \sqrt{n}}.
\end{equation}
This expression must be taken as the error on the average when
performing a least square fit on averaged data, for example.
Two quantities are related to the central moments of $3\st{rd}$
and $4\th$ order. Each of these quantities are normalized by the
adequate power of the standard deviation needed to yield a
quantity without dimension.
\noindent The skewness is defined by:
\begin{equation}
a={\displaystyle n \over\displaystyle
\left(n-1\right)\left(n-2\right)s^3}m_3
={\displaystyle 1 \over\displaystyle
\left(n-1\right)\left(n-2\right)}
\sum_{i=1}^n\left({\displaystyle x_i-\bar{x}\over\displaystyle s}\right)^3.
\end{equation}
The skewness is a measure of the asymmetry of a distribution. If
the skewness is positive, the observed distribution is asymmetric
toward large values and vice-versa.
\noindent The kurtosis is defined by
\begin{equation}
\label{eq:kurtosis}
\begin{array}{rl}
k=& {\displaystyle n\left(n+1\right) \over\displaystyle
\left(n-1\right)\left(n-2\right)\left(n-3\right)s^4}m_4 -{\displaystyle 3\left(n-1\right)^2 \over\displaystyle
\left(n-2\right)\left(n-3\right)}\\*[2ex]
=& {\displaystyle \left(n+1\right) \over\displaystyle
\left(n-1\right)\left(n-2\right)\left(n-3\right)}
\displaystyle\sum_{i=1}^n\left({\displaystyle x_i-\bar{x}\over\displaystyle s}\right)^4 -{\displaystyle 3\left(n-1\right)^2 \over\displaystyle
\left(n-2\right)\left(n-3\right)}
\end{array}
\end{equation}
The kurtosis is a measure of the peakedness or flatness of a
distribution in the region of the average. The subtracted term in
equation \ref{eq:kurtosis} is a convention defining the kurtosis
of the normal distribution as 0\footnote{One talks about a {\sl
platykurtic} distribution when the kurtosis is negative, that is
the peak of the distribution is flatter than that of the normal
distribution. Student (\cf section \ref{sec:ttest}) and Cauchy
(\cf section \ref{sec:cauchydist}) distributions are platykurtic.
The opposite is called {\sl leptokurtic}. The Laplace (\cf section
\ref{sec:laplacedist}) distribution is leptokurtic. }.
As we have seen, the average, standard deviation, skewness and
kurtosis are parameters, which helps characterizing a distribution
of observed values. To keep track of these parameters, it is handy
to define an object whose responsibility is to accumulate the
moments up to order 4. One can then easily compute the parameters
of the distribution. It can be used in all cases where
distribution parameters are needed.
\subsection{Statistical moments --- Smalltalk implementation}
\marginpar{Figure \ref{fig:statisticsclasses} with the box {\bf
FastStatisticalMoments} grayed.} To describe this implementation
we must anticipated on the next section: the class {\bf
FastStatisticalMoments} implementing statistical moments as
described in Section \ref{sec:moments} is a subclass of the class
defined in section \ref{sec:robustmoment}.
Space allocation is handled by the superclass. The class {\tt
FastStatisticalMoments} uses this allocation to store the moments
(instead of the central moments). The method {\tt accumulate:}
perform the accumulation of the moments. The methods {\tt
average}, {\tt variance}, {\tt skewness} and {\tt kurtosis}
compute the respective quantities using explicit expansion of the
central moments as a function of the moments.
The computation of the standard deviation and of the error on the
average are handled by the superclass (\cf listing
\ref{ls:genmoments}).
\label{sec:smoments}Listing \ref{ls:fastmoments} shows the
Smalltalk implementation. The class {\tt
DhbFastStatisticalMoments} is a subclass of class {\tt
DhbStatisticalMoments} presented in listing \ref{ls:genmoments} of
section \ref{sec:srobustmoment}. The reason for the split into two
classes will become clear in section \ref{sec:robustmoment}.
The following code shows how to use the class {\tt
DhbFastStatisticalMoments} to accumulate measurements of a random
variable and to extract the various distribution parameters
discussed in section \ref{sec:moments}.
\begin{codeExample}
\label{ex:smoments}
\begin{verbatim}
| accumulator valueStream average stdev skewness kurtosis |
accumulator := DhbFastStatisticalMoments new.
[ valueStream atEnd ]
whileFalse: [ accumulator accumulate: valueStream next ].
average := accumulator average.
stdev := accumulator standardDeviation.
skewness := accumulator skewness.
kurtosis := accumulator kurtosis.
\end{verbatim}
\end{codeExample}
This example assumes that the measurement of the random variable
are obtained from a stream. The exact implementation of the stream
is not shown here.
After the declarative statements, the first executable statement
creates a new instance of the class {\tt
DhbFastStatisticalMoments} with the default dimension. This
default allocates enough storage to accumulate up to the moment of
$4\th$ order. The next two lines are the accumulation proper using
a {\tt whileFalse:} construct and the general behavior of a
stream. The last four lines extract the main parameters of the
distribution.
If any of the distribution's parameters --- average, variance,
skewness or kurtosis --- cannot be computed, the returned value is
{\tt nil}.
\begin{listing} Smalltalk fast implementation of statistical moments \label{ls:fastmoments}
\input{Smalltalk/Statistics/DhbFastStatisticalMoments}
\end{listing}
\section{Robust implementation of statistical moments}
\label{sec:robustmoment} The methods used to implement the
computation of the central moments in the previous section is
prone to rounding errors. Indeed, contribution from values distant
from the average can totally offset a result, however infrequent
they are. Such an effect is worse when the central moments are
derived from the moments. This section gives an algorithm ensuring
minimal rounding errors.
The definition of statistical moments is based on the concept of
expectation value. The expectation value is a linear operator over
all functions of the random variable. If one measures the values
of the random variable $n$ times, the expectation value of a
function $f\left(x\right)$ of a random variable $x$ is estimated
by the following expression:
\begin{equation}
\label{eq:expectation}
\left\langle f\left(x\right)\right\rangle_n
= {1\over n}\sum_{i=1}^n f\left(x_i\right),
\end{equation}
where the values $x_1,\ldots,x_n$ are the measurements of the
random variable. A comparison of equation \ref{eq:expectation}
with \ref{eq:average} shows that the average is simply the
expectation value of the function $f\left(x\right)=x$. The central
moment of order $k$ is the expectation value of the function
$\left(x-\bar{x}\right)^k$:
\begin{equation}
\label{eq:expectmoment}
\left\langle \left(x-\bar{x}\right)^k\right\rangle_n
= {1\over n}\sum_{i=1}^n \left(x_i-\bar{x}\right)^k.
\end{equation}
To miminize rounding errors, one computes the changes occurring to
the central moments when a new value is taken into account. In
other words, one computes the value of a central moment over $n+1$
values as a function of the central moment over $n$ values and the
$\left(n+1\right)\th$ value. For the average, we have
\begin{equation}
\label{eq:accumaverage}
\begin{array}{ll}
\left\langle x\right\rangle_{n+1}
&= \displaystyle{1\over n+1}\sum_{i=1}^{n+1} x_i\\
&=\displaystyle {x_{n+1}\over n+1}+{1\over n+1}\sum_{i=1}^n x_i\\
&=\displaystyle {x_{n+1}\over n+1}+{n\over n+1}\left\langle x\right\rangle_n\\
&=\displaystyle {x_{n+1}\over n+1}+\left(1-{1\over n+1}\right)\left\langle x\right\rangle_n\\
&=\displaystyle\left\langle x\right\rangle_n- {\left\langle x\right\rangle_n-x_{n+1}\over
n+1}.
\end{array}
\end{equation}
Thus, the estimator of the average over $n+1$ measurements can be
computed from the estimator of the average over $n$ measurements
by subtracting a small correction, $\Delta_{n+1}$, given by:
\begin{mainEquation}
\label{eq:deltaverage}
\begin{array}{ll}
\Delta_{n+1}&=\displaystyle\left\langle x\right\rangle_n - \left\langle
x\right\rangle_{n+1}\\
&=\displaystyle{\left\langle x\right\rangle_n-x_{n+1}\over
n+1}.
\end{array}
\end{mainEquation}
The expression in the numerator of equation \ref{eq:deltaverage}
subtracts two quantities of comparable magnitude. This ensures a
minimization of the rounding errors.
A similar derivation can be made for the central moments of higher
orders. A complete derivation is given in appendix
\ref{sec:centralmoments}. The final expression is
\begin{mainEquation}
\label{eq:deltamoment}
\left\langle\left(x-\bar{x}\right)^k\right\rangle_{n+1}
={n\over n+1}\left\{
\left[ 1 - \left(-n\right)^{k-1}\right]\Delta_{n+1}^k
+\sum_{l=2}^k \left({l\atop k}\right)
\left\langle\left(x-\mu\right)^l\right\rangle_n
\Delta_{n+1}^{k-l}
\right\}.
\end{mainEquation}
The reader can verify the validity of equation
\ref{eq:deltamoment} by verifying that it gives 1 for $k=0$ and 0
for $k=1$. Put in this form, the computation of the central moment
estimators minimizes indeed rounding errors. For the central
moment of order 2 we have:
\begin{mainEquation}
\label{eq:accumvariance}
\left\langle\left(x-\bar{x}\right)^2\right\rangle_{n+1}
={n\over n+1}\left\{
\left(1+n\right)\Delta_{n+1}^2
+\left\langle\left(x-\bar{x}\right)^2\right\rangle_n
\right\}.
\end{mainEquation}
For the central moment of order 3 we have:
\begin{mainEquation}
\label{eq:accumskewness}
\left\langle\left(x-\bar{x}\right)^3\right\rangle_{n+1}
={n\over n+1}\left\{
\left(1-n^2\right)\Delta_{n+1}^3
+3\left\langle\left(x-\bar{x}\right)^2\right\rangle_n\Delta_{n+1}
+\left\langle\left(x-\bar{x}\right)^3\right\rangle_n
\right\}.
\end{mainEquation}
For the central moment of order 4 we have:
\begin{mainEquation}
\label{eq:accumkurtosis}
\begin{array}{ll}
\left\langle\left(x-\bar{x}\right)^4\right\rangle_{n+1}
=\displaystyle{n\over n+1}&\left\{
\left(1+n^3\right)\Delta_{n+1}^4
+6\left\langle\left(x-\bar{x}\right)^2\right\rangle_n\Delta_{n+1}^2\right.\\
&\left.+4\left\langle\left(x-\bar{x}\right)^3\right\rangle_n\Delta_{n+1}
+\left\langle\left(x-\bar{x}\right)^4\right\rangle_n
\right\}.
\end{array}
\end{mainEquation}
\subsection{Robust central moments --- General implementation}
\marginpar{Figure \ref{fig:statisticsclasses} with the boxes {\bf
StatisticalMoments} and {\bf FixedStatisticalMoments} grayed.} The
class {\tt StatisticalMoments} has a single instance variable {\tt
moments} used to store the accumulated central moments.
The evaluation of equation \ref{eq:deltamoment} is not as hard as
it seems from a programming point of view. One must remember that
the binomial coefficients can be obtained by recursion (Pascal
triangle). Furthermore, the terms of the sum can be computed
recursively from those of the previous order so that raising the
correction $\Delta_{n+1}$ to an integer power is never made
explicitly. Equation \ref{eq:deltamoment} is implemented in method
{\tt accumulate}. The reader will notice that the binomial
coefficients are computed inside the loop computing the sum.
Accumulating the central moments using equation
\ref{eq:deltamoment} has the advantage that the estimated value of
the central moment is always available. Nevertheless, accumulation
is about 2 times slower than with the brute force method exposed
in section \ref{sec:moments}. The reader must decide between speed
and accuracy to chose between the two implementations.
The class {\tt FixedStatisticalMoments} is a subclass of class
{\tt StatisticalMoments} specialized in the accumulation of
central moments up to order 4. Instead of implementing the general
equation \ref{eq:deltamoment}, the central moments are accumulated
using equations \ref{eq:accumvariance}, \ref{eq:accumskewness} and
\ref{eq:accumkurtosis}. The only instance method redefined by this
class is the method {\tt accumulate}. All other computations are
performed using the methods of the superclass.
\subsection{Robust central moments --- Smalltalk implementation}
\label{sec:srobustmoment} Listing
\ref{ls:genmoments} shows the implementation of the robust
statistical moments. Listing \ref{ls:fixedmoments} shows a
specialization to optimize the speed of accumulation for the most
frequently used case (accumulation up to the $4\th$ order).
Using the class is identical for all classes of the hierarchy.
Thus, the code example presented in section \ref{sec:smoments} is
also valid for these two classes.
The creation method {\tt new:} takes as argument the highest order
of the accumulated moments. The corresponding initialization
method allocates the required storage. The creation method {\tt
new} corresponds to the most frequent usage: the highest order is
4.
The methods computing the distribution parameters --- average,
variance, skewness and kurtosis --- are using the method {\tt
centralMoment:} retrieving the central moment of a given order.
They will return {\tt nil} if not enough data as been accumulated
in the moments.
\begin{listing} Smalltalk implementation of accurate statistical moments \label{ls:genmoments}
\input{Smalltalk/Statistics/DhbStatisticalMoments}
\end{listing}
The class {\tt DhbFixedStatisticalMoments} is a specialization of
the class {\tt DhbStatisticalMoments} for a fixed number of
central moments going up to the $4\th$ order.
The class creation method {\tt new:} is barred from usage as the
class can only be used for a fixed number of moment orders. As a
consequence the default creation method must be redefined to
delegate the parametric creation to the method of the superclass.
\begin{listing} Smalltalk implementation of accurate statistical moments with fixed orders \label{ls:fixedmoments}
\input{Smalltalk/Statistics/DhbFixedStatisticalMoments}
\end{listing}
\section{Histograms}
\label{sec:histogram} Whereas statistical moments provides a quick
way of obtaining information about the distribution of a measured
random variable, the information thus provided is rather terse and
quite difficult to interpret by humans. Histograms provide a more
complete way of analyzing an experimental distribution. A
histogram has a big advantage over statistical moments: it can
easily be represented graphically. Figure \ref{fig:histogram}
shows a typical histogram.
\begin{figure}
\centering\includegraphics[width=12cm]{Figures/Histogram}
\caption{A typical histogram}\label{fig:histogram}
\end{figure}
A histogram is defined by three main parameters: $x_{\min}$, the
minimum of all values accumulated into the histogram, $w$, the bin
width and $n$, the number of bins. A bin is defined as an
interval. The $i\th$ bin of a histogram is the interval $\left[
x_{\min}+\left(i-1\right)w, x_{\min}+iw\right[$. The customary
convention is that the lower limit is included in the interval and
the higher limit excluded from the interval. The bin contents of a
histogram --- or histogram contents for short --- is the number of
times a value falls within each bin interval. Sometimes, a
histogram is defined by the minimum and maximum of the accumulated
values and the number of bins. The bin width is then computed as:
\begin{equation}
w = {x_{\max} - x_{\min} \over n},
\end{equation}
where $x_{\max}$ is the maximum of the accumulated values.
In section \ref{sec:mlfhist} we shall need the error on the
contents of a histogram. In absence of any systematic
effects\footnote{A good example of systematic effect is when
values are computed from measurements made with an ADC. In this
case, the integer rounding of the ADC may interfere with the bin
sorting of the histogram.} the contents of each bin are
distributed according to a Poisson distribution. The standard
deviation of a Poisson distribution is the square root of the
average. The standard deviation is used as an estimator of the
error on the bin contents\footnote{This is not a contradiction to
what was said in section \ref{sec:moments}: the bin content is not
an average, but a counting}. If $n_i$ is the content of the $i\th$
bin of the histogram, the estimated error on the contents is
$\sqrt{n_i}$.
To obtain more information about the measured distribution, one
can also keep track of the number of values falling outside of the
histogram limits. The {\sl underflow} of a histogram is defined as
the number of values falling below the minimum of the accumulated
values. Similarly, the {\sl overflow} of a histogram is defined as
the number of values falling on\footnote{This is different from
the definition of the underflow to be consistent with the fact
that the definition of a bin interval is open ended at the upper
limit.} or above the maximum of the accumulated values.
\subsection{Histograms --- General implementation}
\marginpar{Figure \ref{fig:statisticsclasses} with the box {\bf
Histogram} grayed.} Our implementation of histogram also
accumulates the values into statistical moments. One can in
principle compute the statistical moments of the measured
distribution from the histogram contents. This determination,
however, depends strongly on the bin width, especially if the bin
width is large compared to the standard deviation. Thus, it is
preferable to use the original data when accumulating the
statistical moments. The net result is that a histogram has the
same polymorphic behavior as a statistical moment.
When defining a histogram, the histogram limits --- $x_{\min}$ and
$x_{\max}$ --- must be known in advance. This is not always
practical since it implies a first scan of the measurements to
determine the limits and a second scan to perform the accumulation
into the defined histogram. Thus, our implementation offers the
possibility of defining a histogram without predefined limits. In
this mode, the first values are cached into an array until a
sufficient number of data is available. When this happens, the
histogram limits are determined from the data and the cached
values are accumulated.
There are some cases when one would like to accumulates all the
values within the histogram limits. The proposed implementation
allows this by changing the histogram limits accordingly when a
new value falls outside of the current histogram limits. When a
histogram is accumulated in this mode the underflow and overflow
counts are always zero.
When the histogram limits are computed automatically, it can
happen that these limits have odd values. For example, if the
minimum value is 2.13456 and the maximum value is 5.1245,
selecting a number of bins of 50 would yield a bin width of
0.0597988. Of course such value for the bin width is quite
undesirable in practice. A similar thing can happen if the
application creating the histogram obtains the minimum and maximum
values from a computation or an automatic measuring device. To
avoid such silly parameters, our implementation computes a
reasonable limit and bin width by rounding the bin width to the
nearest reasonable scale at the order of magnitude\footnote{Let us
recall that the order of magnitude is the power of ten of a
number.} of the bin with. The possible scales are chosen to be
easily computed by a human. In our example, the order of magnitude
is $-2$. The bin width is then selected to be 0.075 and the
minimum and maximum are adjusted to be integral multiples of the
bin width enclosing the given limits. In our example, there are
$2.1$ and $5.175$ and the number of bins becomes 41 instead of 50.
\subsection{Histograms --- Smalltalk implementation}
\label{sec:shistogram} Listing \ref{ls:histogram} shows the
implementation of a histogram in Smalltalk. The following code
shows how to use the class {\tt DhbHistogram} to accumulate
measurements into a histogram.
\begin{codeExample}
\begin{verbatim}
| histogram valueStream |
histogram := DhbHistogram new.
[ valueStream atEnd ]
whileFalse: [ histogram accumulate: valueStream next ].
\end{verbatim}
\hfil {\tt<\sl printing or display of the histogram\tt >}\hfil
\end{codeExample}
This example assumes that the measurement of the random variable
are obtained from a stream. The exact implementation of the stream
is not shown here.
After the declarative statements, the first executable statement
creates a new instance of the class {\tt DhbHistogram} with the
default settings: automatic determination of the limits for 50
desired bins. The next two lines are the accumulation proper using
a {\tt whileFalse:} construct and the general behavior of a
stream. This code is very similar to the code example presented in
section \ref{sec:smoments}. Extracting the parameters of the
distribution can also be performed from the histogram.
\noindent The next example shows how to declare a histogram with
given limits (2 and 7) and a desired number of bins of 50:
\begin{codeExample}
\begin{verbatim}
| histogram valueStream |
histogram := DhbHistogram new.
histogram setRangeFrom: 2.0 to: 7.0 bins: 100.
\end{verbatim}
\hfil {\tt<\sl the rest is identical to the previous example\tt
>}\hfil
\end{codeExample}
\noindent The class {\tt DhbHistogram} has the following instance
variables:
\begin{description}
\item[\tt minimum] the minimum of the accumulated values, that
is $x_{\min}$,
\item[\tt binWidth] the bin width, that is $w$,
\item[\tt overflow] a counter to accumulate the overflow of the histogram,
\item[\tt underflow] a counter to accumulate the underflow of the histogram,
\item[\tt moments] an instance of the class {\tt
DhbFixedStatisticalMoments} to accumulate statistical moments up
to the $4\th$ order (\cf section \ref{sec:srobustmoment}) with
minimal rounding errors.
\item[\tt contents] the contents of the histogram, that is an
array of integers,
\item[\tt freeExtent] a Boolean flag denoting whether the limits
of the histogram can be adjusted to include all possible values,
\item[\tt cacheSize] the size of the cache allocated to collect
values for an automatic determination of the histogram limits,
\item[\tt desiredNumberOfBins] the number of bins desired by the
calling application.
\end{description}
Since there are many ways to declare a histogram, there is a
single creation method {\tt new}, which calls in turn a single
standard initialization method {\tt initialize}. In this mode. the
histogram is created with undefined limits --- that is, the first
accumulated values are cached until a sufficient number is
available for an automatic determination of the limits --- and a
default number of bins. The default number of bins is defined by
the class method {\tt defaultNumberOfBins}.
\noindent Four methods allow to change the default initialization.
The method {\tt setRangeFrom:to:bins:} allows the definition of
the parameters $x_{\min}$, $x_{\max}$ and $n$, respectively. The
method {\tt setWidth:from:bins:} allows the definition of the
parameters $w$, $x_{\min}$ and $n$, respectively. In both cases,
the histogram limits and number of bins are adjusted to reasonable
values as explained at the end of section \ref{sec:histogram}.
These methods generate an error if the histogram is not cached, as
limits cannot be redefined while the histogram is accumulating.
The method {\tt setDesiredNumberOfBins:} allows to overrule the
default number of bins. Finally, the method {\tt freeExtent:}
takes a Boolean argument to define whether or not the limits must
be adjusted when an accumulated value falls outside of the
histogram limits. This method generates an error if any count has
been accumulated in the underflow or overflow.
The method {\tt accumulate} is used to accumulate the values. If
the histogram is still cached --- that is when values are directly
accumulated into the cache for later determination of the limits
--- accumulation if delegated to the method {\tt
accumulateInCache:}. In this mode, the instance variable {\tt
contents} is an {\tt OrderedCollection} collecting the values.
When the size of the collection is reaching the maximum size
allocated to the cache, limits are computed and the cache is
flushed. In direct accumulation mode, the bin index corresponding
to the value is computed. If the index is within the range, the
value is accumulated. Otherwise it is treated like an overflow or
an underflow. The method {\tt processOverflows:for:} handles the
case where the accumulated values falls outside of the histogram
limits. If histogram limits cannot be adjusted it simply counts
the overflow or the underflow. Otherwise processing is delegated
to the methods {\tt growsContents:}, {\tt growsPositiveContents}
and {\tt growsNegativeContents}, which adjust the histogram limits
according to the accumulated value.
The adjustment of the histogram limits to reasonable values is
performed by the method {\tt adjustDimensionUpTo:}. This is made
when the limits are determined automatically. This method is also
used when the limits are specified using one of the initialization
methods.
There are many methods used to retrieve information from a
histogram. Enumerating them here would be too tedious. Method
names are explicit enough to get a rough idea of what each method
is doing; looking at the code should suffice for a detailed
understanding. The reader should just note that all methods
retrieving the parameters of the distribution, as discussed in
section \ref{sec:smoments}, are implemented by delegating the
method to the instance variable {\tt moments}.
The iterator method {\tt pointsAndErrorsDo:} is used for maximum
likelihood fit of a probability distribution. Smalltalk
implementation of maximum likelihood fit is discussed in section
\ref{sec:smlfhist}.
\begin{listing} Smalltalk implementation of histograms \label{ls:histogram}
\input{Smalltalk/Statistics/DhbHistogram}
\end{listing}
\section{Random number generator}
\label{sec:random} When studying statistical processes on a
computer one often has to simulate the behavior of a random
variable\footnote{Another wide use for random number generators
are games.}. As we shall see in section \ref{sec:probdistr} it
suffice to implement a random generator for a uniform
distribution, that is a random variable whose probability density
function is constant over a given interval. Once such an
implementation is available, any probability distribution can be
simulated.
\rubrique{Linear congruential random generators} The most widely
used random number generators are linear congruential random
generators. Random numbers are obtained from the following series
\cite{Knuth2}:
\begin{equation}
\label{eq:crg}
X_{n+1} = \left(aX_n+c\right) \mod m,
\end{equation}
where $m$ is called the modulus, $a$ the multiplier and $c$ the
increment. By definition, we have $0\leq X_n<m$ for all $n$. The
numbers $X_n$ are actually pseudo-random numbers since, for a
given modulus, multiplier and increment, the sequence of numbers
$X_1,\ldots,X_n$ is fully determined by the value $X_0$. The value
$X_0$ is called the seed of the series. In spite of its
reproducibility the generated series behaves very close to that of
random variable uniformly distributed between 0 and $m-1$. Then
the following variable:
\begin{equation}
x_n = {X_n \over m},
\end{equation}
is a random rational number uniformly distributed between 0 and 1,
1 excluded.
In practice, the modulus, multiplier and increment must be chosen
quite carefully to achieve a good randomness of the series. Don
Knuth \cite{Knuth2} gives a series of criteria for choosing the
parameters of the random number generator. If the parameters are
correctly chosen, the seed $X_0$ can be assigned to any value.
\rubrique{Additive sequence generators} Another class of random
generators are additive sequence generators \cite{Knuth2}. The
series of pseudo-random numbers is generated as follows:
\begin{equation}
X_n = \left(X_{n-l}+X_{n-k}\right) \mod m,
\end{equation}
where $m$ is the modulus as before and $l$ and $k$ are two indices
such that $l<k$. These indices must be selected carefully.
\cite{Knuth2} contains a table of suitable indices. The initial
series of numbers $X_1,\ldots,X_k$ can be any integers not all
even.
Generators based on additive sequences are ideal to generate
floating point numbers. If this case, the modulo operation on the
modulus is not needed. Instead, one simply checks whether or not
the newly generated number is larger than 1. Thus, the series
becomes:
\begin{equation}
\label{eq:addseqrandom}
\begin{array}{lcl}
y_n&=&x_{n-l}+x_{n-k},\\*[2ex]
x_n&=&\left\{
\begin{array}{ll}
y_n&\mbox{\quad if $y_n<1$,}\\
y_n-1&\mbox{\quad if $y_n\geq 1$,}\\
\end{array}
\right.
\end{array}
\end{equation}
It is clear that the evaluation above is much faster than that of
equation \ref{eq:crg}. In practice, the additive sequence
generator is about 4 times faster. In addition, the length of the
sequence is larger than that of a congruential random generator
with the same modulus.
In our implementation we have selected the pair of numbers
$\left(24,55\right)$ corresponding to the generator initially
discovered by G.J. Mitchell and D.P. Moore\cite{Knuth2}. The
corresponding sequence has a length of $2^{55}-1$. In our tests
(\cf below) we have found that the randomness of this generator is
at least as good as that of the congruential random generator. The
initial series $x_1,\ldots,x_{55}$ is obtained from the
congruential random generator.
In \cite{Knuth2} Don Knuth describes a wealth of test to
investigate the randomness of random number generators. Some of
these tests are also discussed in \cite{LawKel}. To study the
goodness of our proposed random generators, we have performed two
types of tests: a $\chi^2$ test and a correlation test.
The $\chi^2$ test is performed on a histogram, in which values
generated according to a probability distributions have been
accumulated. Then, a $\chi^2$ confidence level (\cf section
\ref{sec:chitest}) is calculated against the theoretical curve
computed using the histogram bin width, the number of generated
values and the parameters of the distribution. A confidence level
should be larger than $60\%$ indicates that the probability
distribution is correctly generated. When using distributions
requiring the generation of several random numbers to obtain one
value --- Normal distribution (2 values), gamma distribution (2
values) and beta distribution (4 values)
--- one can get a good confidence that short term
correlations\footnote{Pseudo random number generators have a
tendency to exhibit correlations in the series. That is, the
number $X_n$ can be correlated to the number $X_{n-k}$ for each
$n$ and a given $k$.} are not creating problems. The code for this
test is given in the code examples \ref{exs:chitest}. In this test the Mitchell-Moore
generator gives results similar to that of the congruential random
generator.
The correlation test is performed by computing the covariance
matrix (\cf section \ref{sec:covmatrix}) of vectors of given
dimension (between 5 and 10). The covariance matrix should be a
diagonal matrix with all diagonal elements equal to $1/12$, the
variance of a uniform distribution. Deviation from this
theoretical case should be small. Here longer correlations can be
investigated by varying the dimension of the generated vectors. In
this test too, the Mitchell-Moore generator gives results similar
to that of the congruential random generator.
\rubrique{Bit-pattern generators} The generators described above
are suitable to the generation of random values, but not for the
generation of random bit patterns \cite{Knuth2}, \cite{Press}. The
generation of random bit patterns can be achieved with generator
polynomials. Such polynomials are used in error correction codes
for their abilities to produce sequences of numbers with a maximum
number of different bits. For example the following polynomial
\begin{equation}
\label{eq:ranpolgen}
G\left(x\right) = x^{16} + x^{12} +x^5 + 1,
\end{equation}
is a good generator for random patterns of 16 bits\footnote{\cf O.
Yamada, K. Yamazaki and D.H.Besset, {\em An Error-Correction
Scheme for an Optical Memory Card System}, 1990 International
Symposium on Information Theory and its Applications (ISITA'90),
Hawaii, USA, November 27-30, 1990.}. Of course, the evaluation of
equation \ref{eq:ranpolgen} does not require the computation of
the polynomial. The following algorithm can be used: {\parskip 0pt
\begin{enumerate}
\item Set $X_{n+1}$ to $X_n$ shifted by one position to the left
and truncated to 16 bits ($X_{n+1}=2X_n \mod 2^{16}$),
\item If bit 15 (least significant bit being 0) of $X_n$ is set,
set $X_{n+1}$ to the bit wise exclusive OR of $X_{n+1}$ with
{\tt 0x1021}.
\end{enumerate}}
\noindent Other polynomials are given in \cite{Press}.
Random bit patterns are usually used to simulate hardware
behavior. They are rarely used in statistical analysis. A concrete
implementation of a random bit pattern generator is left as an
exercise to the reader.
\subsection{Random number generator --- Smalltalk implementation}
\marginpar{Figure \ref{fig:statisticsclasses} with the boxes {\bf
CongruentialRandomNumberGenerator} and {\bf
MitchellMooreGenerator} grayed.} Listing \ref{ls:randomcong} shows
the implementation of a congruential random generator in
Smalltalk. Listing \ref{ls:randomseq} shows the implementation of
a additive sequence random generator in Smalltalk. Listing
\ref{ls:randomuse} shows usage of the generator for standard use.
\noindent The class {\tt DhbCongruentialRandomNumberGenerator} has
three public methods:
\begin{description}
\item[\tt value] returns the next random number of the series,
that is $X_n$, a number between $0$ and $m$,
\item[\tt floatValue] returns the next random floating number,
that is the value $X_n / m$,
\item[\tt integerValue:] returns a random integer, whose values
are between 0 and the specified argument.
\end{description}
When calling any of the above methods, the next random number of
the series is obtained using equation \ref{eq:crg}.
There are several ways of using a random number generator. If
there is no specific requirement the easiest approach is to use
the instance provided by default creation method ({\tt new})
returning a \patstyle{singleton}. The next example shows how to
proceed assuming the application uses the values generated by the
random generator directly:
\begin{codeExample}
\label{codex:crgdefault}
\begin{verbatim}
| generator x |
generator := DhbCongruentialRandomNumberGenerator new.
\end{verbatim}
\hfil{\tt <\sl Here is where the generator is used\tt
>}\hfil
\begin{verbatim}
x := generator value.
\end{verbatim}
\end{codeExample}
If one needs several series which must be separately reproducible,
one can assign several generators, one for each series. The
application can use predefined numbers for the seed of each
series. Here is a complete example assuming that the generated
numbers must be floating numbers.
\begin{codeExample}
\label{codex:crgseeded}
\begin{verbatim}
| generators seeds index x |
\end{verbatim}
{\tt seeds := <\sl an array containing the desired seeds\tt >}
\begin{verbatim}
generators := seeds collect:
[ :each | DhbCongruentialRandomNumberGenerator seed: each].
\end{verbatim}
\noindent\hskip 5ex{\tt <\sl Here is where the various generators
are used\tt
>}\hfil \\
\hskip 5ex{\tt <index \sl is the index of the desired series\tt
>}\hfil
\begin{verbatim}
x := ( generators at: index) floatValue.
\end{verbatim}
\end{codeExample}
In game applications, it is of course not desirable to have a
reproducible series. In this case, the easiest way is to use the
time in milliseconds as the seed of the series. This initial value
is sufficiently hard to reproduce to give the illusion of
randomness . Furthermore the randomness of the series guarantees
that two series generated at almost the same time are quite
different. Here is how to do it.
\begin{codeExample}
\begin{verbatim}
| generator x |
generator := DhbCongruentialRandomNumberGenerator
seed: Time millisecondClockValue.
\end{verbatim}
\hfil{\tt <\sl Here is where the generator is used\tt
>}\hfil
\begin{verbatim}
x := (generator integerValue: 20) + 1.
\end{verbatim}
\end{codeExample}
In this last example, the generated numbers are integers between 1
and 20.
\rubrique{Implementation} The class {\tt
DhbCongruentialRandomNumberGenerator} has the following instance
variables:
\begin{description}
\item[\tt constant] the increment $c$,
\item[\tt modulus] the modulus $m$,
\item[\tt multiplicator] the multiplier $a$ and
\item[\tt seed] the last generated number $X_{n-1}$.
\end{description}
There are three instance creation methods. The method {\tt new}
returns a \patstyle{singleton} instance containing parameters from
\cite{Knuth2}. The method {\tt seed:} allows one to create an
instance to generate a series of random number starting at a given
seed. Finally the method {\tt constant:multiplicator:modulus:}
creates a congruential random number generator based on supplied
parameters. Readers tempted to use this method are strongly
advised to read \cite{Knuth2} and the references therein
thoroughly. Then, they should perform tests to verify that their
parameters are indeed producing a series with acceptable
randomness.
The modulus of the standard parameters has 32 bits. In the
Smalltalk implementation, however, the evaluation of equation
\ref{eq:crg} generates integers larger than 32 bits. As a result,
the generation of the random numbers is somewhat slow, as it is
using multiple precision integers. Using floating
number\footnote{The author is grateful to Dave N. Smith of IBM for
this useful tip.} does not disturb the evaluation of equation
\ref{eq:crg} and is significantly faster since floating point
evaluation is performed on the hardware. The generation of random
numbers using floating point parameters is about 3 times faster
than with integer parameters. This can easily be verified by the
reader.
\begin{listing} Smalltalk implementation of congruential random number generators
\label{ls:randomcong}
\input{Smalltalk/Statistics/DhbCongruentialRandomNumberGenerator}
\end{listing}
\noindent The class {\tt DhbMitchellMooreGenerator} implements a
random number generator with additive sequence. It has two public
methods:
\begin{description}
\item[\tt floatValue] returns the next random floating number,
that is the value $x_n$,
\item[\tt integerValue:] returns a random integer, whose values
are between 0 and the specified argument.
\end{description}
When calling any of the above methods, the next random number of
the series is obtained using equation \ref{eq:addseqrandom}. The
series of generated numbers are all floating points.
The creation methods {\tt new} and {\tt seed:} are used exactly as
the corresponding methods of the class {\tt
DhbCongruentialRandomNumberGenerator}. Please refer to the code
examples \ref{codex:crgdefault} and \ref{codex:crgseeded}. Both
methods use the congruential random number generator to generate
the initial series of numbers $x_1,\ldots,x_{55}$. The class
method {\tt constants:lowIndex:} offers a way to define the
numbers $k$ and $l$ as well as the initial series of numbers. The
reader wishing to use this method should consult the table of {\sl
good} indices $k$ and $l$ in \cite{Knuth2}.
\begin{listing} Smalltalk implementation of an additive sequence random number generator
\label{ls:randomseq}
\input{Smalltalk/Statistics/DhbMitchellMooreGenerator}
\end{listing}
For simple simulation, one wishes to generate a random number ---
floating or integer --- within certain limits. Here are
convenience methods implemented for the class {\tt Number} and
{\tt Integer}. These methods frees the user from keeping track of
the instance of the random number generator. For example, the
following Smalltalk expression
\begin{quote}
\begin{verbatim}
50 random
\end{verbatim}
\end{quote}
\noindent generates an integer random number between 0 and 49
included. Similarly the following Smalltalk expression
\begin{quote}
\begin{verbatim}
2.45 random
\end{verbatim}
\end{quote}
\noindent generates a floating random number between 0 and $2.45$
excluded. Finally the following Smalltalk expression
\begin{quote}
\begin{verbatim}
Number random
\end{verbatim}
\end{quote}
\noindent generates a floating random number between 0 and 1
excluded.
\begin{listing} Smalltalk implementation of random number generators
\label{ls:randomuse}
\input{Smalltalk/Statistics/Integer(DhbStatistics)}
\input{Smalltalk/Statistics/Number(DhbStatistics)}
\end{listing}
\section{Probability distributions}
\label{sec:probdistr}A probability density function defines the
probability of finding a continuous random variable within an
infinitesimal interval. More precisely, the probability density
function $P\left(x\right)$ gives the probability for a random
variable to take a value lying in the interval
$\left[x,x+dx\right[$. A probability density function is a special
case of a one variable function described in section
\ref{sec:function}.
The moment of $k^{\mathop{\rm th}}$ order for a probability
density function $P\left(x\right)$ is defined by:
\begin{equation}
\label{eq:probdensity}