-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy path3-probability_practice.tex
957 lines (894 loc) · 31.6 KB
/
3-probability_practice.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
\chapter{Probability in Practice: \\Probabilistic Computation}
The immediate problem with the abstract definitions introduced in the
previous chapter is that they do not provide an explicit means of either
specifying a probability distribution, at least without considering every
single event, or computing expectations in practical applications. When
the sample space is structured, however, that structure can be leveraged
to provide the explicit implementations we need to apply probability
theory in practice. This is particularly evident when the sample space
ordered, discrete, or a subset of real numbers.
\section{Cumulative Distribution Functions}
When the sample space is ordered we can completely specify a probability
distribution with only the probabilities assigned to \emph{intervals},
$\mathcal{I} \! \left( \Theta \right) \subset \mathcal{E} \! \left( \Theta \right)$.
Intervals are events spanning all points less than or equal to some
distinguished point, $\theta$,
%
\begin{equation*}
I \! \left( \theta \right) = \left\{ \theta' \in \Theta \mid \theta' \leq \theta \right\}.
\end{equation*}
%
The function that assigns these probabilities,
%
\begin{align*}
P
&: \Theta \rightarrow \mathcal{I} \!
\left( \Theta \right) \rightarrow \left[0, 1 \right]
\\
&\quad \theta \mapsto
I \! \left( \theta \right) \;\; \mapsto
\PP \! \left[ I \! \left( \theta \right) \right].
\end{align*}
%
is called the \emph{cumulative distribution function}.
Cumulative distribution functions naturally pushforward from one
sample space into another. Given a measureable map
$s : \Theta \rightarrow \Omega$, the cumulative distribution function
on $\Omega$ is defined by
%
\begin{equation*}
P_{\Omega} \! \left( I_{\omega} \right)
\equiv
P_{\Theta} \! \left( s^{-1} \! \left( I_{\omega} \right) \right).
\end{equation*}
\section{Implementations of Probability Distributions \\
over Discrete Sample Spaces}
Every discrete sample space features a natural measure which we
can use to not only specify all other measures, and hence probability
distributions, but also convert expectations into summations.
\subsection{The Counting Measure and Mass Functions}
The \emph{counting measure} on a discrete space assigns unit
measure to the \emph{point events},
%
\begin{equation*}
\chi \! \left( \theta \right) = 1, \, \forall \theta \in \Theta.
\end{equation*}
%
In order to compute the measure of any other event we
simply count the elements in that event,
%
\begin{equation*}
\chi \! \left( E \right)
=
\sum_{\theta \in E} 1 = \left| E \right|,
\end{equation*}
%
hence the name. Similarly, Lebesgue integrals reduce to a
summation over the sample space weighted by the function
values,
%
\begin{equation*}
\mathcal{I}_{\chi} \! \left[ f \right]
=
\sum_{\theta \in \Theta}
f \! \left( \theta \right) \chi \! \left( \theta \right)
=
\sum_{\theta \in \Theta} f \! \left( \theta \right),
\end{equation*}
%
for any $f \in \mathcal{F} \! \left( \Theta \right)$.
Because the counting measure assigns zero measure only to the
null event, all discrete measures are absolutely continuous with
respect to it. Consequently, any discrete measure can be defined
by its Radon-Nikodym derivative with respect to the counting measure.
In this case the Radon-Nikodym derivative is denoted a
\emph{mass function}
%
\begin{equation*}
m : \Theta \rightarrow \RR^{+}.
\end{equation*}
%
Measures and Lebesgue integrals then reduce to discrete summations
weighted by the mass function,
%
\begin{align*}
\mu \! \left( E \right)
&=
\sum_{\theta \in E} m \! \left( \theta \right)
\\
\mathcal{I}_{\mu} \! \left[ f \right]
&=
\sum_{\theta \in \Theta} f \! \left( \theta \right) m \! \left( \theta \right).
\end{align*}
Moreover, the Radon-Nikodym derivative between different discrete
measures reduces to the simple ratio of their mass functions,
%
\begin{equation*}
\frac{ \dd \mu_{2} }{ \dd \mu_{1} } \! \left( \theta \right)
=
\frac{ m_{2} \! \left( \theta \right) }{ m_{1} \! \left( \theta \right) }.
\end{equation*}
\subsection{Transforming Counting Measures and Mass Functions}
The counting measure, and hence mass functions, also have the
convenient property that they transform naturally with changes to
the underlying sample space. Given a measurable map
$s : \Theta \rightarrow \Omega$, a mass function on $\Theta$
pushes forward to a mass function on $\Omega$,
%
\begin{equation*}
m_{\Omega} \! \left( \omega \right)
\equiv
m_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right).
\end{equation*}
\subsection{Probability Mass Functions}
Mass functions corresponding to probability distributions reduce
to \emph{probability mass functions}, which assigns probability to the
point events,
%
\begin{equation*}
p : \Theta \rightarrow \left[0, 1\right].
\end{equation*}
%
such that
%
\begin{equation*}
1 = \pi \! \left[ \Theta \right] = \sum_{\theta \in \Theta} p \! \left( \theta \right).
\end{equation*}
\subsection{Conditional Probability Mass Functions}
Probability mass functions immediately generalize to specify conditional
probability distributions by simply adding a conditioning variable,
%
\begin{equation*}
p_{\Theta \mid \Phi} : \Theta \times \Phi \rightarrow \left[0, 1\right].
\end{equation*}
%
Conditional probabilities and conditional expectations are computed
as above,
%
\begin{align*}
\PP_{\Theta \mid \Phi} \! \left[ E \mid \phi \right]
&=
\sum_{\theta \in E} p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right),
\\
\mathbb{E}_{\PP_{\Theta \mid \Phi} } \! \left[ f \mid \phi \right]
&=
\sum_{\theta \in \Theta} f \! \left( \theta \right)
p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right),
\end{align*}
A huge advantage of this representation is that it drastically simplifies the
construction of joint and marginal probability distributions. Instead of
implicitly defining an abstract joint distribution with probability assignments
or expectations, for example, we can construct an explicit joint probability
mass function as the product of a conditional mass function and a marginal
mass function,
%
\begin{equation*}
p_{\Theta \times \Phi} \! \left( \theta, \phi \right) =
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation*}
%
which readily gives joint probabilities and joint expectations.
Marginalization proceeds similarly -- the marginal probability mass function
is given by simply summing the joint probability mass function over the
nuisance components,
%
\begin{align*}
p_{\Theta} \! \left( \theta \right)
&=
\sum_{\phi \in \Phi} p_{\Theta \times \Phi} \! \left( \theta, \phi \right) \\
&=
\sum_{\phi \in \Phi}
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right).
\end{align*}
Bayes' Theorem implies that any decomposition of a joint probability
mass function is equal,
%
\begin{equation} \label{eqn:discrete_bayes}
p_{\Phi \mid \Theta} \! \left( \phi | \theta \right) p_{\Theta} \! \left( \theta \right)
=
p_{\Theta \times \Phi} \! \left( \theta, \phi \right)
=
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation}
%
and from this one equality we can then recover the various implications of
Bayes' Theorem. For example,
%
\begin{equation*}
\pi_{\Theta \mid \Phi} =
\frac{ \dd \pi_{\Phi \mid \Theta} }{ \dd \pi_{\Phi} }
\pi_{\Theta}
\end{equation*}
%
becomes
%
\begin{equation*}
p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right) =
\frac{ p_{\Phi \mid \Theta} \! \left( \phi \mid \theta \right) }
{ p_{\Phi} \! \left( \phi \right) }
p_{\Theta} \! \left( \theta \right),
\end{equation*}
%
which is an immediate consequence of dividing both sides of the joint
equality \eqref{eqn:discrete_bayes} by $p_{\Phi} \! \left( \phi \right)$.
\subsection{Relating Probability Mass Functions and Cumulative
Distribution Functions}
Because probability mass functions and cumulative distribution functions
both specify the same probability distribution, one can always be used to
construct the other. Given a probability mass function, for example, we
can construct the cumulative distribution function as
%
\begin{equation*}
P \! \left( \theta \right)
= \PP \! \left[ I \! \left( \theta \right) \right]
= \sum_{\theta' \in I ( \theta )} p \! \left( \theta' \right).
\end{equation*}
%
Similarly, we can construct a probability mass function from a cumulative
distribution function as
%
\begin{equation*}
p \! \left( \theta \right) =
P \! \left( \theta \right)
- P \! \left( \theta_{-} \right),
\end{equation*}
%
where $\theta_{-}$ is the largest element of $\Theta$ less that $\theta$,
%
\begin{equation*}
\theta_{-} = \max \left\{ \theta' \in \Theta \mid \theta' < \theta \right\}.
\end{equation*}
\section{Implementations of Probability Distributions over Real
Numbers}
When the sample space are $D$-dimensional real numbers, or
a subset thereof, there are always an uncountably infinite number
of point events and we can't assign a finite measure to each without
the measure of any dense event becoming infinite. Consequently,
the counting measure becomes pathological, assigning either infinite
measure or zero measure to every dense event. Fortunately, the real
numbers admit their own natural measure on which we can implement
probability distributions in practice.
\subsection{The Lebesgue Measure and Density Functions}
\textbf{Lebesgue measure on R assigns measure based on interval
length, then generalize as a product of Lebesgue measures that
assigns measure based on volume.}
The Lebesgue measure on the one-dimensional real numbers
assigns to each interval,
%
\begin{equation*}
I = \left[a, b \right] \subset \RR,
\end{equation*}
%
a measure proportional to its length,
%
\begin{equation*}
\lambda \! \left( I \right) = b - a.
\end{equation*}
This generalizes to the general Lebesgue measure, $\lambda$,
which assigns to each rectangular neighborhood
%
\begin{equation*}
I_{D} = \prod_{d=1}^{D} I_{d} = \prod_{d=1}^{D} \left[a_{d}, b_{d} \right]
\subset \RR^{D},
\end{equation*}
%
a measure proportional to its volume,
%
\begin{equation*}
\lambda \! \left( I_{d} \right) =
\prod_{d = 1}^{D} \left(b_{d} - a_{d} \right).
\end{equation*}
%
Although not immediately obvious, this condition implies that the
Lebesgue measure of an arbitrary event is also given by the
enclosed volume specified by the Riemann integral over that event
%
\begin{equation*}
\lambda \! \left( E \right)= \int_{E} \dd^{D} \theta.
\end{equation*}
%
Because of this equivalence, it is common shorthand to write the
Lebesgue measure on $\Theta = R^{D}$ as
%
\begin{equation*}
\lambda = \dd^{D} \theta = \prod_{d = 1}^{D} \dd \theta_{d}.
\end{equation*}
Similarly, Lebesgue integrals with respect to the Lebesgue measure
(mathematicians have sick sense of humor) also reduce to Riemann
integrals,
%
\begin{equation*}
\mathcal{I}_{\lambda} \! \left( f \right)
= \int_{\Theta} f \! \left( \theta \right) \dd \theta.
\end{equation*}
%
An immediate consequence of this result is that all point events
have zero Lebesgue measure.
Provided that a measure does not assign non-zero measure to any
point events, it will be absolutely continuous with respect to the
Lebesgue measure and hence can be completely specified with
the corresponding Radon-Nikodym derivative. In this case the
Radon-Nikodym derivative is denoted a \emph{density function},
%
\begin{equation*}
m : \Theta \rightarrow \RR^{+}.
\end{equation*}
%
As above, any measure or Lebesgue integral becomes a standard
Riemann integral that can be readily computed with the usual
calculus techniques,
%
\begin{align*}
\mu \! \left( E \right)
&=
\mathcal{I}_{\lambda} \! \left[ m \cdot I_{E} \right]
=
\int_{E} m \! \left( \theta \right) \dd \theta
\\
\mathcal{I}_{\mu} \! \left[ f \right]
&=
\mathcal{I}_{\lambda} \! \left[ f \cdot m \right]
=
\int_{\Theta} f \! \left( \theta \right) m \! \left( \theta \right) \dd \theta.
\end{align*}
The Radon-Nikodym derivative between two different real measures
behaves exactly as in the discrete case. If $\mu_{2}$ is
absolutely continuous with respect to $\mu_{1}$ then the density
$m_{2} \! \left( \theta \right)$ can vanish only when the density
$m_{1} \! \left( \theta \right)$ vanishes, and the corresponding
Radon-Nikodym derivative is given by
%
\begin{equation*}
\frac{ \dd \mu_{2} }{ \dd \mu_{1} } \! \left( \theta \right)
=
\frac{ m_{2} \! \left( \theta \right) }{ m_{1} \! \left( \theta \right) }.
\end{equation*}
%
When both densities vanish we have to be careful take an
appropriate limit in order to evaluate the Radon-Nikodym derivative.
\subsection{Transforming Lebesgue Measures and Density Functions}
One of the most subtle properties of Lebesgue measures, and
hence density functions, is that they do not transform as naturally
as their discrete counterparts. The difficulty is that nonlinear
transformations warp the real numbers, distorting volumes and,
consequently, the Lebesgue measure.
Consider a nonlinear, measurable map $s : \Theta \rightarrow \Omega$,
and let $\tilde{\lambda}_{\Omega}$ be the corresponding pushforward
of the Lebesgue measure on $\Theta$ onto $\Omega$,
%
\begin{equation*}
\tilde{\lambda}_{\Omega} \! \left( E_{\Omega} \right)
\equiv
s_{*} \lambda_{\Theta} \! \left( E_{\Omega} \right)
=
\mu^{L}_{\Theta} \! \left[ s^{-1} \! \left( E_{\Omega} \right) \right].
\end{equation*}
%
The pushforward $\tilde{\lambda}_{\Omega}$ is not the Lebesgue
measure on $\Omega$, rather the two are related by the Radon-Nikodym
derivative
%
\begin{equation*}
\frac{ \dd \lambda_{\Omega} }
{\dd \tilde{\lambda}_{\Omega} }
=
\left| \mathbf{J} \right|,
\end{equation*}
%
where the matrix
%
\begin{equation*}
J_{ij}
=
\frac{\partial \omega_{i} }{ \partial \theta_{j} }
\equiv
\frac{ \partial s_{i} }{ \partial \theta_{j} }
\end{equation*}
%
is called the \emph{Jacobian} of the transformation. The determinant
of the Jacobian matrix quantifies how the nonlinear transformation
warps volumes and hence the Lebesgue measure.
This nontrivial relationship between Lebesgue measures and their
pushforwards has an important consequence for density functions.
Density functions do not transform as we might naively expect,
%
\begin{equation*}
m_{\Omega} \! \left( \omega \right)
\ne
m_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right),
\end{equation*}
%
rather they too are modified by the determinant of the Jacobian matrix,
%
\begin{equation*}
m_{\Omega} \! \left( \omega \right)
=
m_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right) | \mathbf{J} |^{-1}.
\end{equation*}
%
We can also motivate this relationship by considering the invariance
of Lebesgue integrals. Equivalent sample spaces feature distinct
Lebesgue measure and density functions, but they must have the
\emph{same Lebesgue integrals} (Figure \ref{fig:pdf_transform}).
The Lebesgue integrals are invariant only if the Lebesgue measures
and densities transform oppositely to each other.
\begin{figure*}
\centering
%
\subfigure[]{
%\begin{tikzpicture}[scale=0.3, thick, show background rectangle]
\begin{tikzpicture}[scale=0.3, thick]
\node[] at (0,0) {\includegraphics[width=5cm]{density.eps}};
\foreach \i in {-10.5, -10, ..., 10.5} {
\draw[-, color=gray80] (-11, \i) to (11, \i);
\draw[-, color=gray80] (\i, -11) to (\i, 11);
}
\node[] at (0,0) {\includegraphics[width=5cm]{area.eps}};
\node[] at (0, -13) { $\mu \! \left( E_{\Theta} \right) = \int_{E_{\Theta}}
m_{\Theta} (\theta_{1}, \theta_{2})
\dd \theta_{1} \dd \theta_{2} $ };
\end{tikzpicture}
}
%
\subfigure[]{
%\begin{tikzpicture}[scale=0.3, thick, show background rectangle]
\begin{tikzpicture}[scale=0.3, thick]
\node[] at (0,0) {\includegraphics[width=5cm]{transformed_density.eps}};
\foreach \i in {-10.9969, -10.9428, -10.8857, -10.8251, -10.7605, -10.6915, -10.6174,
-10.5374, -10.4504, -10.355, -10.2497, -10.1319, -9.99835, -9.84419,
-9.66186, -9.4387, -9.15099, -8.74547, -8.05217, 0., 8.05217,
8.74547, 9.15099, 9.4387, 9.66186, 9.84419, 9.99835, 10.1319,
10.2497, 10.355, 10.4504, 10.5374, 10.6174, 10.6915, 10.7605,
10.8251, 10.8857, 10.9428, 10.9969} {
\draw[-, color=gray80] (-11, \i) to (11, \i);
\draw[-, color=gray80] (\i, -11) to (\i, 11);
}
\node[] at (0,0) {\includegraphics[width=5cm]{transformed_area.eps}};
\node[] at (0, -13) { $\mu \! \left( s \! \left( E_{\Theta} \right) \right)
= \int_{s(E_{\Theta})}
m_{\Omega} (\omega_{1}, \omega_{2})
\dd \omega_{1} \dd \omega_{2}$ };
\end{tikzpicture}
}
\caption{When sample spaces are real, each sample space has
its own density functions (red), events (green), and differential volumes
(grey). Here the sample space in (b) is related to the sample space in
(a) by the compatible mapping $\left(\omega_{1}, \omega_{2} \right)
= s \!\left( \theta_{1}, \theta_{2} \right)$. All of these differences, however,
exactly compensate to ensure that integrals always yield the same
measures, $\mu \! \left( E_{\Theta} \right) =
\mu \! \left( s \! \left( E_{\Theta} \right) \right) $.
}
\label{fig:pdf_transform}
\end{figure*}
Consequently, density functions depend not just on the abstract
system that we're trying to describe but also on the particular
representation that we're using. The only application of density
functions that are not sensitive to this dependence is integration,
and we have to be careful not to take densities too seriously when
they are not explicitly under an integral sign.
\subsubsection{Univariate Examples of Density Transformations}
A common motivation for transforming the sample space, and
hence any probability density functions, is when the nominal sample
space is contained. Mapping to an unconstrained sample space
removes the unconstraint, potentially simplifying calculations.
For example, let $p_{\Theta} \! \left( \theta \right)$ be a probability
distribution function defined on $\Theta = \RR^{+}$. We can
unconstrained the positivity constraint by applying a logarithmic
transformation,
%
\begin{equation*}
\omega = s \! \left( \theta \right) = \log \theta \in \RR.
\end{equation*}
%
which results in the Jacobian
%
\begin{equation*}
J
= \frac{ \partial s }{ \partial \theta} \! \left( \omega \right)
= \frac{1}{\exp \! \left( \omega \right)}
\end{equation*}
%
and the transformed probability density function on $\Omega = \RR$,
%
\begin{align*}
p_{\Omega} \! \left( \omega \right)
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right) \left|J\right|^{-1}
\\
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right)
\exp \! \left( \omega \right).
\end{align*}
Similarly, we can unconstrain a sample space representing probabilities
themselves, $\Theta = \left[0, 1 \right]$ by applying a logistic transform,
%
\begin{equation*}
\omega = s \! \left( \theta \right) = \frac{1}{1 + \exp( - \theta ) } \in \RR.
\end{equation*}
%
This results in the Jacobian
%
\begin{equation*}
J
= \frac{ \partial s }{ \partial \theta} \! \left( \omega \right)
= \frac{1}{\omega \! \left( 1 - \omega \right) }
\end{equation*}
%
and the transformed probability density function on $\Omega = \RR$,
%
\begin{align*}
p_{\Omega} \! \left( \omega \right)
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right) \left|J\right|^{-1}
\\
&=
p_{\Theta} \! \left( s^{-1} \! \left( \omega \right) \right)
\omega \! \left( 1 - \omega \right).
\end{align*}
\subsubsection{A Multivariate Example of Density Transformations}
To see how much a probability density function can be warped, consider
a two-dimensional sample space with real components
$\left( \theta_{1}, \theta_{2} \right)$ and the probability density function
%
\begin{equation*}
p_{\Theta} \! \left( \theta_{1}, \theta_{2} \right)
\propto
\exp \! \left( - \frac{\theta_{1}^{2} + \theta_{2}^{2}}{2} \right).
\end{equation*}
%
and the measurable map
%
\begin{equation*}
\left( \omega_{1}, \omega_{2} \right)
=
r \! \left( \theta_{1}, \theta_{2} \right)
=
\left( s \! \left( \theta_{1} \right), s \! \left( \theta_{2} \right) \right),
\end{equation*}
%
where the component maps are given by
%
\begin{equation*}
s \! \left( \theta \right)
=
\log \! \left(
\frac{ \pi + 2 \, \mathrm{atan} \! \left( \alpha \, \theta \right) }
{ \pi - 2 \, \mathrm{atan} \! \left( \alpha \, \theta \right) }
\right)
\end{equation*}
%
with the inverse
%
\begin{equation*}
s^{-1} \! \left( \omega \right)
=
\frac{1}{\alpha}
\tan \! \left(
\frac{\pi}{2} \, \frac{e^{\omega} - 1}{e^{\omega} + 1} \right)
\end{equation*}
%
and Jacobian
%
\begin{equation*}
J \! \left( \omega \right)
=
\frac{ \partial s }{ \partial \theta } \! \left( \omega \right)
=
\frac{\alpha}{\pi} \frac{ \left( 1 + e^{\omega} \right)^{2} }{ e^{\omega} }
\sin^{2} \! \left( \frac{ \pi }{ 1 + e^{\omega} } \right).
\end{equation*}
%
The Jacobian of the complete map is then given by
%
\begin{equation*}
\mathbf{J} =
\begin{bmatrix}
J & 0 \\
0 & J
\end{bmatrix},
\end{equation*}
%
with the determinant $| \mathbf{J} | = J^{2}$. Hence the transformed
probability density function and differential volume are given by
%
\begin{equation*}
p_{\Omega} \! \left( \omega_{1}, \omega_{2} \right)
=
p_{\Theta} \! \left(
s^{-1} \! \left( \omega_{1} \right),
s^{-1} \! \left( \omega_{2} \right) \right)
J^{-2} \! \left( \omega \right)
\end{equation*}
%
and
%
\begin{equation*}
\dd \omega_{1} \dd \omega_{2}
=
J^{2} \dd \theta_{1} \dd \theta_{2},
\end{equation*}
%
respectively. The extreme differences between these two representations
are evident graphically in Figure \ref{fig:pdf_transform}.
\subsection{Probability Density Functions}
Density functions corresponding to probability distributions reduce
to \emph{probability density function} which assign a positive value
to each point in the sample space,
%
\begin{equation*}
p : \Theta \rightarrow \mathbb{R}^{+}.
\end{equation*}
such that
%
\begin{equation*}
1 = \pi \! \left( \Theta \right) = \int_{\Theta} p \! \left( \theta \right) \dd \theta.
\end{equation*}
%
These values, known as \emph{probability densities}, have no particular
meaning of their own and instead exist only to be integrated to give
probabilities,
%
\begin{equation*}
\pi \! \left( E \right)
=
\int_{E} p \! \left( \theta \right) \dd \theta,
\end{equation*}
%
and expectations,
%
\begin{equation*}
\mathbb{E}_{\pi} \! \left[ f \right]
=
\int_{\Theta} f \! \left( \theta \right) p \! \left( \theta \right) \dd \theta.
\end{equation*}
This is an important point that is worth repeating -- probability densities
are meaningless until they have been integrated over some event. To
analogize with physics, the event over which we integrate corresponds
to a \emph{volume} and the probability given by integrating the density
over such a volume corresponds to a \emph{mass}. When we want to
be careful to differentiate between probabilities and probability densities
we'll use \emph{probability mass} to refer to the former.
\subsection{Conditional Probability Density Functions}
Just as in the discrete case, probability density functions can be
immediately extended to implement conditional probability distributions
by simply adding a conditioning variable,
%
\begin{equation*}
p_{\Theta \mid \Phi} : \Theta \times \Phi \rightarrow \mathbb{R}^{+},
\end{equation*}
%
with conditional probabilities and conditional expectations reducing
to integrals,
%
\begin{align*}
\pi_{\Theta \mid \Phi} \! \left( E \mid \phi \right)
&=
\int_{E} p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right) \dd \theta,
\\
\mathbb{E}_{\pi_{\Theta \mid \Phi} } \! \left[ f \mid \phi \right]
&=
\int_{\Theta} f \! \left( \theta \right)
p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right) \dd \theta,
\end{align*}
Likewise, probability density functions representing joint and marginal
probability distributions are easy to construct for conditional probability
density functions. Joint probability density functions are given by a simple
multiplication,
%
\begin{equation*}
p_{\Theta \times \Phi} \! \left( \theta, \phi \right) =
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation*}
%
and marginal probability density functions are given by integrating out the
nuisance components,
%
\begin{align*}
p_{\Theta} \! \left( \theta \right)
&=
\int_{\Phi} p_{\Theta \times \Phi} \! \left( \theta, \phi \right) \dd \phi \\
&=
\int_{\Phi}
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right)
p_{\Phi} \! \left( \phi \right) \dd \phi.
\end{align*}
Bayes' Theorem also implies that any decomposition of a joint probability
density function is equal,
%
\begin{equation} \label{eqn:real_bayes}
p_{\Phi \mid \Theta} \! \left( \phi | \theta \right) p_{\Theta} \! \left( \theta \right)
=
p_{\Theta \times \Phi} \! \left( \theta, \phi \right)
=
p_{\Theta \mid \Phi} \! \left( \theta | \phi \right) p_{\Phi} \! \left( \phi \right),
\end{equation}
%
and from this one equality we can then recover the various implications of
Bayes' Theorem. For example,
%
\begin{equation*}
\pi_{\Theta \mid \Phi} =
\frac{ \dd \pi_{\Phi \mid \Theta} }{ \dd \pi_{\Phi} }
\pi_{\Theta}
\end{equation*}
%
becomes
%
\begin{equation*}
p_{\Theta \mid \Phi} \! \left( \theta \mid \phi \right) =
\frac{ p_{\Phi \mid \Theta} \! \left( \phi \mid \theta \right) }
{ p_{\Phi} \! \left( \phi \right) }
p_{\Theta} \! \left( \theta \right),
\end{equation*}
%
which is an immediate consequence of dividing both sides of the joint
equality \eqref{eqn:real_bayes} by $p_{\Phi} \! \left( \phi \right)$.
\subsection{Relating Probability Density Functions and Cumulative
Distribution Functions}
As with their discrete counterparts, probability density functions
and cumulative distribution functions can always be derived from
each other because they specify the same probability distribution.
Cumulative distribution functions, for example, are given by
integrating over probability density functions,
%
\begin{equation*}
P \! \left( \theta \right)
= \pi \! \left[ I \! \left( \theta \right) \right]
= \int_{\theta_{\min}}^{\theta} p \! \left( \theta' \right) \dd \theta.
\end{equation*}
Probability density functions, on the other hand, are given by
differentiating cumulative distribution functions,
%
\begin{equation*}
p \! \left( \theta \right) =
\frac{\partial P \! \left( \theta \right) }{ \partial \theta}.
\end{equation*}
%
Note that if we map to an equivalent measurable space with
$s : \Theta \rightarrow \Omega$, then the derivative
acquires a factor of the inverse Jacobian so that the
corresponding probability density function transforms
as necessary,
%
\begin{equation*}
p \! \left( \omega \right)
=
\frac{\partial P \! \left( \omega \right) }{ \partial \omega}
=
\frac{\partial P \! \left( s^{-1} \! \left( \omega \right) \right) }
{ \partial \theta}
\frac{ \partial \theta }{ \partial \omega }
=
p \! \left( s^{-1} \! \left( \omega \right) \right) \left| \mathbf{J}^{-1} \right|.
\end{equation*}
\subsection{Implementations of Mixed Probability Distributions}
Distributions over samples spaces that have both a discrete, $\Phi$,
and a real $\Psi$, component can be implemented by leveraging
discrete and continuous representations of conditional distributions.
For example, a distribution over $\Theta = \Phi \times \Psi$ can be
specified by conditioning the discrete component with the real
component, $\pi_{\Phi \mid \Psi}$, and providing a marginal distribution
over the real component, $\pi_{\Psi}$. Using a conditional
probability mass function for the former and a probability density function
for the latter, the probability of any event is given by
%
\begin{align*}
\pi_{\Theta} \! \left( E \right)
&=
\pi_{\Theta} \! \left( E_{\Phi} \times E_{\Psi} \right)
\\
&=
\EE_{\pi_{\Psi}} \! \left[
\pi_{\Phi \mid \Psi} \! \left( E_{\Phi} \mid \psi \right)
\cdot
\mathbb{I}_{E_{\Psi}} \! \left( \psi \right)
\right]
\\
&=
\int_{E_{\Psi} }
\sum_{\phi \in E_{\Phi} } p_{\Phi \mid \Psi} \! \left( \phi \mid \psi \right)
p_{\Psi} \! \left( \psi \right) \dd \psi,
\end{align*}
%
with expectations given similarly by
%
\begin{align*}
\EE_{\pi_{\Theta}} \! \left[ f \right]
&=
\EE_{\pi_{\Psi}} \! \left[
\EE_{\pi_{\Phi \mid \Psi}} \! \left[ f \mid \psi \right]
\right]
\\
&=
\int_{E_{\Psi} } \sum_{\phi \in E_{\Phi} } f \! \left( \phi, \psi \right)
p_{\Phi \mid \Psi} \! \left( \phi \mid \psi \right)
p_{\Psi} \! \left( \psi \right) \dd \psi.
\end{align*}
Equivalently, we could also condition the real component on the
discrete component, $\pi_{\Psi \mid \Phi}$ and provide a marginal
distribution over the discrete component, $\pi_{\Phi}$. We could
then represent the distribution with a conditional probability density
function for the former and a probability mass function for the latter.
Likewise, the probability of any event is given by
%
\begin{align*}
\pi_{\Theta} \! \left( E \right)
&=
\pi_{\Theta} \! \left( E_{\Phi} \times E_{\Psi} \right)
\\
&=
\EE_{\pi_{\Phi}} \! \left[
\pi_{\Psi \mid \Phi} \! \left( E_{\Psi} \mid \phi \right)
\cdot
\mathbb{I}_{E_{\Phi}} \! \left( \phi \right)
\right]
\\
&=
\sum_{\phi \in E_{\Phi} } \int_{E_{\Psi} }
p_{\Psi \mid \Phi} \! \left( \psi \mid \phi \right)
p_{\Phi} \! \left( \phi \right) \dd \psi,
\end{align*}
%
with expectations given similarly by
%
\begin{align*}
\EE_{\PP_{\Theta}} \! \left[ f \right]
&=
\EE_{\PP_{\Phi}} \! \left[
\EE_{\PP_{\Psi \mid \Phi}} \! \left[ f \mid \phi \right]
\right]
\\
&=
\sum_{\phi \in E_{\Phi} } \int_{E_{\Psi} }
f \! \left( \phi, \psi \right)
p_{\Psi \mid \Phi} \! \left( \psi \mid \phi \right)
p_{\Phi} \! \left( \phi \right) \dd \psi.
\end{align*}
Regardless of how we choose to decompose the mixed distribution,
combining probability density functions and probability mass functions
makes specifying and manipulating these distributions straightforward in
practice.
\section{Stochastic Implementations of Probability Distributions}
We can also specify any probability distributions and compute its
expectations \emph{stochastically}. A \emph{stochastic process} is
any mechanism that generates a sequence of states, or \emph{samples},
from a given sample space,
$\left\{ \theta_{1}, \ldots, \theta_{N} \right\} \subset \Theta$. In general,
the generation of the $n$-th state in the sequence, $\theta_{n}$, can
depend on all previous states in the sequence,
$\left\{ \theta_{1}, \ldots, \theta_{n - 1} \right\}$.
A stochastic process implements a given probability distribution if
the samples themselves can be used to recover all expectations as
the size of the sequence becomes infinitely large. More formally, if
%
\begin{equation*}
\lim_{N \rightarrow \infty} \frac{1}{N}
\sum_{n = 1}^{N} f \! \left( \theta_{n} \right)
= \EE_{\pi} \! \left[ f \right],
\end{equation*}
%
for all well-behaved functions, $f : \Theta \rightarrow \RR$, then the
stochastic process implements all computations with respect to the
probability distribution $\pi$.
This procedure, and hence the resulting samples, are \emph{exact}
if every state in the sequence is generated independently of any previous
states. In other words, samples are exact when the stochastic process
generates samples one at a time, with no dependence on the preceding
or following samples. If samples are not exact we refer to them as
\emph{correlated}.
Exact stochastic processes are, by construction, perfectly
\emph{random} processes: there is no way to predict any element
of the sampling sequence given the state of other samples in
that sequence. Unfortunately such randomness is impossible to
achieve in practice as computers are fundamentally deterministic.
Instead we rely on \emph{pseudorandom} processes which
generate sequences in a convoluted but ultimately deterministic
fashion. When we are ignorant of the precise configuration of a
pseudorandom process the resulting samples \emph{appear} to be
random for all intents and purposes.
Pseudorandom processes that target specific probability distributions,
or \emph{pseudorandom number generators}, can be devised for
many simple probability distributions. Unfortunately they are typically
impossible to construct for the more complex distributions that are
often of practical interest.