-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBayesian Modeling Section I Distributions.Rmd
580 lines (422 loc) · 27.1 KB
/
Bayesian Modeling Section I Distributions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
---
title: "Introduction to Bayesian Modeling"
subtitle: "Section 1 - Distributions"
output: pdf_document
---
## Level-Set
In this section, we focus on distributions. In Section 2, we'll add equations *(e.g., regression)* and then explore multilevel models and pooling. This level assumes that you're familiar with basic probability *(including conditional probability)*, distributions, and parameter estimation using gradient descent and other derivative-based methods, linear algebra estimators, and maximum likelihood. Now we will look at parameter estimation using sampling.
Bayes Rule, stated in terms for parameters and data $P(\theta | Data) = \frac {P(\theta) * P(Data | \theta))}{P(Data)}$, works well for inverting probabilities in a closed system. In this case, the denominator $P(Data)$ is the marginal probability of the population. It serves as a normalizing **constant**. Recall how, in maximum likelihood, we can drop the constant portion of the binomial density function $\frac{n!}{n-h!}$ while varying $p$ for likelihood. We can likewise drop the normalizing constant here. Then, the equation becomes:
$P(\theta | Data) \propto P(\theta) * P(Data | \theta)$, or
$Posterior(\theta | Data) \propto Prior(\theta) * Likelihood(Data | \theta)$, where $\propto$ means "proportionate to". We can normalize later.
So, that's where we begin. Now, let's discuss a few overall concepts that will be helpful going forward:
* The model's **target** is $Posterior(\theta | Data)$, i.e., we want the **parameters** *(conditioned on the data)*. Once we have a $\theta$ vector of parameters, we can simulate, project, and make inferences about, the posterior. In the business transaction space, where volume and dynamics are high, we don't want a model doing prediction tasks - we just want the model to provide the parameters, and then we'll use more performant, extensible, interoperable programs to generate predictions, and share parameters with other programs.
* The $\theta$ values across the prior and the likelihood are seldom the same structure *(i.e, different distributions)*. And since these distribution are multiplied together to get the posterior, we usually don't know what the posterior is going to look like. We can guess about the distribution *(actually, we have to do that)*, but that's just a start. Then, we use the sampler to find the parameters for us.
* Posterior parameters often become priors in subsequent analyses *(note how the prior is also defined by parameters* $P(\theta)$, *so there's a natural transition)*. This is an important advantage of Bayesian modeling, as we'll see. Also note that parameters can number in hundreds, or thousands *(think of a multilevel model with a couple of levels and hundreds of groups in each level - each with a unique set of parameters)*.
Now, we're going to walk through three models with conjugate distributions *(distributions of the same family)*. This is a nice, clean scenario, and a good exercise for building intuition. We're also going to walk through direct calculation of posterior parameters with two of the models *(the last one - a skew normal doesn't have a closed form solution, so we can't do that)*. Then we'll use these directly determined posteriors as a baseline for comparing and understanding sampling.
We'll also parameter a distribution to get a feel for how that works - as this is a handy skill.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
First, load the following libraries:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
library(tidyverse)
library(rstan)
library(sn)
```
## I. Normal Distributions (Conjugate Prior)
Generate some data using the following parameters for a prior distribution. We'll also *normalize* the data by dividing the density values by the sum(density) *(this keeps all the distributions on the same scale and helps with some of the direct calculations we'll do - not necessary when using a sampler)*. Show the prior, likelihood:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
# set seed not necessary
set.seed(12)
priorMean <- 10
priorSigma <- 2
likeMean <- 5
likeSigma <- 2
x = seq(from = 0, to = 20, length.out = 100)
dfNorm <- data.frame(x=x, prior = dnorm(x, priorMean ,priorSigma))
dfNorm$prior <- dfNorm$prior/sum(dfNorm$prior) #normalize
p <- ggplot(data = dfNorm, aes(x, prior)) +
geom_line() +
theme(panel.background = element_rect(fill = "white")) +
ylab("Probability")
dfNorm$like <- dnorm(x, likeMean, likeSigma)
dfNorm$like <- dfNorm$like/sum(dfNorm$like) # normalize
p <- p + geom_line(data = dfNorm, aes(x, like), color = 'blue')
p
```
Now, let's create a posterior $P(\theta) * P(Data | \theta)$:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo=T}
dfNorm$post <- dfNorm$like*dfNorm$prior # multiply densities
dfNorm$post <- dfNorm$post/sum(dfNorm$post) # normalize
p <- p + geom_line(data = dfNorm, aes(x, post), color = 'red')
p
```
*We often say that the posterior is a compromise between the prior and likelihood. Can you see why?*
**So, what are the parameters of the posterior? Well, we don't know..**
The nice thing about **some** conjugate distributions is that we can compute parameter estimates directly *(you can't just use moments or parameter functions - there's no data - the posterior is imaginary - we only have density factors - look at the dfNorm table)*.
Here's how we can estimate parameters in the conjugate normal case:
$\sigma = \sqrt{\frac{1}{\sigma_1^{-2} + \sigma_2^{-2}}}$
$\mu = posterior \sigma^2 * \frac{\mu_1}{\frac{\sigma^2 + \mu_2}{\sigma_2^2}}$
Then we can compute our confidence intervals *(now that we have the parameters)*:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo=T}
postVar <- 1/(priorSigma^{-2}+likeSigma^{-2}) # Posterior var
ppostSigma <- sqrt(postVar) # Posterior sd
postMean <- postVar*(priorMean/priorSigma^2+likeMean/likeSigma^2) # Posterior mean
lowCI <- qnorm(0.025, postMean, ppostSigma) # Posterior 95% interval
highCI <- qnorm(0.975, postMean, ppostSigma)
```
Visually:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
p <- p + geom_errorbarh(aes(xmin = lowCI, xmax = highCI, y = 0), height = .005, color = "red")
p
```
Now we're going to do a similar estimate using Stan *( https://mc-stan.org/ - please download the user manual, etc. you'll need it)*. The following text is our Stan model to sample this posterior:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
stanMod <- '
data {
int N; // number of observations
vector[N] y; // likelihood data
real priorMu;
real<lower=0> priorSigma;
}
parameters {
real mu; // posterior mean
real<lower=0> sigma; // posterior sigma
}
model {
target += normal_lpdf(y | mu, sigma); // likelihood data
target += normal_lpdf(mu | priorMu, .2); // prior Mu
target += normal_lpdf(sigma | priorSigma, .001); // prior sigma
}
'
```
This Stan model has 3 sections data, parameters, and model *(there are more sections possible, which we'll cover later)*:
* **data**. The data section is where we set up the variables and datatypes, which are passed into the model from the stan function call *(below)*. Vectors are assumed to be numeric *(otherwise, we define the type explicitly - e.g., int x[N];)*.
* **parameters**. The parameters section is where we define the parameters we want the sampler to track - we'll get an estimate for each one. When defining scale parameters, be sure and use "real<lower=0> sigma" syntax, because a scale parameter can't be negative - otherwise you give the sampler permission to wander around looking for a negative sigmas, and that's not going to end well *(samplers are drunk)*!
* **model**. The model section is where we tell the sampler how it all fits together. To get a parameter into the sampling process, we add it to the target log probability density function *(using target += )*, and define the distribution *(e.g., using normal_lpdf for normal log density - there are a TON of options for distributions - RTFM)*. Looking at the statements in the model section:
>target += normal_lpdf(y | mu, sigma); This statement fits a normal distribution to the likelihood data (y). y is conditioned on mu and sigma, and from there draws samples into the posterior target.
>target += normal_lpdf(mu | priorMu, .2); fits the mu, conditioned on the priorMu (which we've set to 10). The sigma of the priorMu is set to .2, which expresses confidence in that prior (this allows the sampler to move the prior between 9.6 and 10.4 (+- 2 se's for 95%) as it tries to fit the posterior parameters), which forces the posterior to settle ~ 7.5.
>target += normal_lpdf(sigma | priorSigma, .001); fits the sigma the same way. The sigma of the sigma (yes, I said that) is set to a draconian .001, so it can only move the sigma between 1.998 and 2.002 (more discipline, less bondage! "High Anxiety").
---
### - Elaboration and Application
One element about the model that may seem new is the definition of the variance of the prior parameters. For example, in the statement: target += normal_lpdf(mu | priorMu, .2), we're judgmentally defining the std deviation of priorMu to be .2 *(keep in mind this is not the variance of the data, it's the variance of the parameter)*. This represents the confidence that we have in the Prior - the lower the $\sigma$, the more confidence we have in the priorMu. If we increase $\sigma$, the sampler is allowed to wander more. What direction? It will wander towards the data *(likelihood)*, i.e, $y$. This is as it should be - if we don't have confidence in our priors, the model trusts the data. This is a Bayesian thing, and a powerful capability - esp. in business. Consider these scenarios:
>Let's say the we're planning profit, production and inventory, and we make heavy equipment with the main components made from steel. We meet with the global sourcing staff and look at the last 2 years of data (likelihood). Several of the procurement managers believe there will be capacity limits in the EU, and a couple of managers also think there will be a global shortage, with an increase in prices. It seems obvious that, in this scenario, we need to consider the possibilities discussed, and we can do that by conditioning the data with prior estimates. As data rolls in, we can refine those priors, but we can't wait until then to build a plan - we have to start production.
>There is a similar scenario regrading sales of new products different markets that was posted in the Repository introduction. These scenarios are about forecasting, which eventually end up in plans and budgets. Don't get the idea that forecasts are just some doodles in a spreadsheet - they are the basis of major commitments and budgets (operating and capital), and have to be approved by the Board of Directors (as mandated in the Corporate Charter). And for assurance and audit professionals, these models explain the relationships between external drivers and business results - often they provide the best evidence for opinions.
These sigmas also give us the ability to **generalize** and **pool** specific parameters. These are powerful tools in the Bayesian kit. If you're in this course, you've hopefully studied generalization. In the machine learning world, L1 and L2 generalization are globally applied. That's just not adequate for mulitlevel enterprise scenarios. And these multiple levels will cluster in effects *(for example, in the scenario above, we might expect the effect of steel prices to be greater in EU locations than in Americas)*
---
So now we run the model. Stan will show progress: *(this is the only time I'll do this in a document, but it's a good practice to monitor sampler output on all models)*.
```{r, message=F, warning=F, fig.width=3, fig.height=3, fig.align="center", echo = T}
N = nrow(dfNorm)
# generate likelihood data
x = seq(from = 0, to = 20, length.out = N)
y <- rnorm(x, mean = likeMean, sd = likeSigma)
fit <- stan(model_code = stanMod, data = list(
y= y,
N = N,
priorMu = priorMean,
priorSigma = priorSigma)
)
```
The output shows the progress of 4 sampling chains *(Stan sets 4 chains with 2k iterations by default)*, and the parameters. A visual representation of the sampling can be created with a trace plot. Compare this with the description of how the sampler finds parameters above *(and you can see how our choice of prior sigmas constrains the sampler)*:
```{r, message=F, warning=F, fig.width=6, fig.height=2, fig.align="center"}
# plot of sampling
stan_trace(fit, inc_warmup = TRUE)
```
And now we can get what we're after:
```{r, message=F, warning=F, fig.width=3, fig.height=3, fig.align="center", echo = T}
fit
```
Note: we usually get the output of the model by using a summary() function on the model fit, as follows *(there are ways to tune this as we'll see later)*.
```{r, message=F, warning=F, fig.width=3, fig.height=3, fig.align="center", echo = T}
# get parameters and plot mu on model
sumFit <- summary(fit)
postMu <- sumFit$summary[1,1]
postSigma <- sumFit$summary[2,1]
```
And we can add our estimated $\mu$ to the previous posterior for comparison:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
dfNormStan = data.frame(x = x, density = dnorm(x, mean = postMu, sd = postSigma ))
dfNormStan$density = dfNormStan$density/sum(dfNormStan$density) #normalize
p <- p + geom_line(data = dfNormStan, aes(x, density), color = 'green')
p
```
and the following is a plot of the fit *(remember, this is the posterior parameters, not the prior)*:
```{r, message=F, warning=F, fig.width=4, fig.height=2, fig.align="center", echo = T}
stan_hist(fit)
```
Again, we generally transfer the parameter values to a dataframe where we can manage them - there can be a LOT!
The most comprehensive tool for diagnosing sampling is shinystan. I don't run this in a markdown document, as it stops the knit function, but please uncomment and run from an r file *(it will open up your browser, and r execution will halt until you shut it down)*
```{r, message=F, warning=F, fig.width=3, fig.height=3, fig.align="center", echo = T}
# library(shinystan)
# launch_shinystan(fit)
# don't launch in a rmd
```
---
#### Perspective
Sampling is a complex and fragile process. We're treating it as a "black box" here, but you will have to debug sampling, which means you'll need to have some intuition, built on hands-on experience and reading everything you can find. Diagnosis level expertise is art and science and requires commitment. This is No BS territory.
---
## II. Beta Binomial Distributions (Conjugate Prior)
If you've worked with logistic regression, you're familiar with a binomial distribution. In this example, we'll generate likelihood data assuming 8 successes out of 20 trials *(so a mean of .4 for a continuous probability distribution)*:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
h <- 8
# n number of trials
n <- 20
# p probability of success
grid <- seq(from = .01, to = 1, by = .01)
dfData <- data.frame(h, n, grid)
dfData$like <- dbinom(h, size = n, prob = grid)
dfData$like <- dfData$like/sum(dfData$like) # normalize
p1 <- ggplot(dfData, aes(x = grid, y = like)) +
geom_line() +
ylab("Probability") +
theme(panel.background = element_rect(fill = "white"))
p1
```
### Reparameterization: Binomial => Beta
The Binomial distribution can be **reparameterized** as a Beta distribution, which like the binomial, represents a density of *probabilities (it can also be reparameterized as a Normal)*
---
##### Diversion to discuss Beta distributions
>We're going to break for a minute to talk about the advantages of the Beta distribution.The classic illustration is baseball batters (recommend David Robinson's "Introduction to Empirical Bayes" - very approachable read). Let's say a batter has 300 AB's and gets 80 hits, so he's batting ~ 267 and our expectation for batting average is in the range of .2-.35 (this could be a prior based on a century of experience). All we need to tell the density function is the successes and failures:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
a = 80
b = 220
bgrid = seq(from = .2, to = .35, length.out = 100)
dfBeta <- data.frame(x <- bgrid, db = dbeta(bgrid, a, b))
# note to ellen - go back and check earlier beta pdf for consistency
dfBeta$db <- dfBeta$db/sum(dfBeta$db) # normalize
p1 <- ggplot(data = dfBeta, aes(x, db)) +
geom_line() +
ylab("Probability") +
theme(panel.background = element_rect(fill = "white"))
p1
```
>Now let's say he gets up to bat 200 more times and gets 60 hits. Updating the parameters is really easy, we just add:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
a = (80 + 60)
b = (220 + 140)
dfBeta2 <- data.frame(x <- bgrid, db = dbeta(bgrid, a, b))
dfBeta2$db <- dfBeta2$db/sum(dfBeta2$db) # standardize
p1 <- p1 +
geom_line(data = dfBeta2, aes(x, db), color = "blue") +
ylab("Probability") +
theme(panel.background = element_rect(fill = "white"))
p1
```
>How easy is that? Also notice that the scale has decreased slightly. Why? Because we have more data (Doh!).
---
OK, back to Bayesian Modeling. We're going to reparameterize the Binomial as a Beta *(because it's easier to work with)*, using the following formulas:
$\alpha = h+1$ *(remember h is number of successes in binomial)*
$\beta = n-h+1$ *(and n is number of trials in binomial)*
So:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
a = h+1
b = (n-h+1)
dfBeta <- data.frame(x <- grid, db = dbeta(grid, a, b))
# note to ellen - go back and check earlier beta pdf for consistency
dfBeta$db <- dfBeta$db/sum(dfBeta$db) # standardize
p1 <- ggplot(data = dfBeta, aes(x, db)) +
geom_line() +
ylab("Probability") +
theme(panel.background = element_rect(fill = "white"))
p1
```
This time, we'll assume that we don't have any experience to add to our analysis - so the likelihood data is all we know. In this case, in the Bayesian world, we use a *non-informative*, or *uniform* prior *(in the Beta distribution with two parameters, a flat distribution is where a and b, are 1)*:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
# now we'll set up a uniform prior (more on uniform priors later)
aPrior <- 1
bPrior <- 1
prior <- dbeta(grid,aPrior,bPrior)
dfBeta$prior <- prior/sum(prior) #standardize
p1 <- p1 +
geom_line(data = dfBeta, aes(x, prior), color = "blue") +
ylab("Probability")
p1
```
Now, we'll estimate the posterior parameters using a direct method *(like above)*:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
post <- dfData$like*prior
dfBeta$post <- post/sum(post)
p1 <- p1 +
geom_line(data = dfBeta, aes(x, post), color = 'red') +
ylab("Probability")
p1
```
So if we use a uniform prior, the posterior reflects the likelihood data *(should make sense)*.
Now, let's change the prior as see what happens *(lets set success rate at .5 in the prior)*:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
a <- h+1
b <- h+1 # setting success and failures equal, so 50% prob
prior <- dbeta(grid,a,b)
dfBeta$prior <- prior/sum(prior) #normalize
p2 <- ggplot(data = dfBeta, aes(x, prior)) +
geom_line(color = "blue") + # prior
geom_line(aes(x, db)) +
ylab("Probability") +
theme(panel.background = element_rect(fill = "white"))
```
Now let's see how that affects the posterior"
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
# now recalculate a posterior
post <- dfData$like*prior
dfBeta$post <- post/sum(post) #normalize
# and redraw
p2 <- p2 +
geom_line(data = dfBeta, aes(x, post), color = "red")
p2
```
And here's how we estimate the parameters using direct calculation:
$Posterior Beta Parameters = [(h+a), (n-h+b)]$
*(note that the h and n came from the likelihood data, and the a and b came from the prior - different distributions that can easily be combined and updated in the Beta distribution - see why we reparameterized?)*:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo = T}
# EXACT CALCUATION -------#
dfBeta$post2 <- dbeta(grid,h+a,n-h+b)
dfBeta$post2 <- dfBeta$post2/sum(dfBeta$post2) #normalize
p2 <- p2 +
geom_line(data = dfBeta, aes(x, post2), linetype = "dotted", color = "black") +
ylab("Probability")
p2
```
And calculating the confidence interval:
```{r, message=F, warning=F, fig.width=2, fig.height=2, fig.align="center", echo=T}
# extending calculation
(h+a)/(n+a+b) # Posterior mean
qbeta(c(0.05,0.95),h+a,n-h+b) # Posterior 90% interval
```
Now, like before, we'll model this in Stan. Notice that the distribution has changed to a beta.
```{r, message=F, warning=F, fig.width=7, fig.height=3, fig.align="center", echo=T}
# The Stan model as a string.
model_string <- "
// Bernoulli model
data {
int<lower=0> n; // number of observations
int<lower=0> h; // likelihood (successes)
int<lower=0> priorA;
int<lower=0> priorB;
}
parameters {
real<lower=0,upper=1> theta;
}
model {
target += beta_lpdf(theta | priorA, priorB);
target += binomial_lpmf(h | n, theta);
}
"
d_bin <- list(
n = n,
h = h,
priorA = a,
priorB = b)
# fit_bin <- stan(model_code = model_string, data = d_bin)
fit_bin <- stan(model_code = model_string, data = d_bin, refresh = 0) # doesn't show ouput
print(fit_bin,digits=2)
sumFit <- summary(fit_bin)
mu <- sumFit$summary[1,1]
sd <- sumFit$summary[1,3] # this is NOT the sd for the posterior (we didn't estimate that in the model)
```
Looking at the model section *(switching order here for discussion purposes)*
>target += binomial_lpmf(h | n, theta); The theta is the probability parameter for the likelihood data which has n of 20 and h of 8. Those are the data - the likelihood - so p wants to go to .4.
>target += beta_lpdf(theta | priorA, priorB); But theta has priors that must be rationalized within a beta distribution. The PiorA is 9 and the PriorB is 9, so the prior wants to pull towards .5. The sampler finds the balance
## III. Skew Normal
The Skew Normal doesn't have a closed form solution, so we won't be able to get exact calculations for comparison *(which will be the case in 99% of your work in business scenarios)*.
```{r, message=F, warning=F, fig.width=3, fig.height=2, fig.align="center", echo=T}
grid = seq(from = 0, to = 10, length.out = 100)
dfData <- data.frame(grid = grid, like = dsn(grid, 1, 3, 10))
dfData$like <- dfData$like/sum(dfData$like)
p3 <- ggplot(data = dfData) +
geom_line(aes(grid, like)) +
ylab("Probability") +
theme(panel.background = element_rect(fill = "white"))
p3
```
Setting the priors:
```{r, message=F, warning=F, fig.width=3, fig.height=2, fig.align="center", echo=T}
priorLocat <- 2
priorScale <- 4
priorSkew <- 9
dfData$prior <- dsn(grid, priorLocat, priorScale, priorSkew)
dfData$prior <- dfData$prior/sum(dfData$prior)
p3 <- p3 +
geom_line(data = dfData, aes(grid, prior), color = "blue") +
theme(panel.background = element_rect(fill = "white"))
p3
```
Even though there's no closed form solution, we're still working with $P(\theta | Data) \propto P(\theta) * P(Data | \theta)$. To the sampler, it really doesn't matter *(although it does affect computation times)*
```{r, message=F, warning=F, fig.width=3, fig.height=2, fig.align="center", echo=T}
dfData$post <- dfData$like*dfData$prior
dfData$post <- dfData$post/sum(dfData$post)
# there is no closed form solution to SN distributions
# we can estimate using likelihood * prior
p3 <- p3 +
geom_line(data = dfData, aes(grid, post), color = "red") +
theme(panel.background = element_rect(fill = "white"))
p3
```
Now for Stan
```{r, message=F, warning=F, fig.width=8, fig.height=8, fig.align="center", echo=T}
stanMod <- '
data {
int N;
vector[N] y;
real xi_mu;
real<lower=0> xi_sigma;
real omega_mu;
real<lower=0> omega_sigma;
real alpha_mu;
real<lower=0> alpha_sigma;
}
parameters {
real xi;
real<lower=0> omega;
real alpha;
}
model {
target += skew_normal_lpdf(y | xi, omega, alpha);
target += normal_lpdf(xi | xi_mu, xi_sigma);
target += normal_lpdf(omega | omega_mu, omega_sigma);
target += normal_lpdf(alpha | alpha_mu, alpha_sigma);
}
'
locat <- 1
scale <- 3
skew <- 10
N = 100
y <- rsn( N , xi= locat, omega= scale , alpha= skew ) # using data sample now
mod <- stan_model(model_code = stanMod)
priorLocat <- 2
priorScale <- 4
priorSkew <- 9
fit <- stan(model_code = stanMod,
data = list(y = y,
xi_mu = priorLocat,
xi_sigma = .5,
omega_mu = priorScale,
omega_sigma = .5,
alpha_mu = priorSkew,
alpha_sigma = .5,
N = 100), refresh = 0)
print(fit)
```
The model here is pretty straightforward:
>target += skew_normal_lpdf(y | xi, omega, alpha); The likelihood data (y) is conditioned on three parameters xi, omega and alpha, which is 1, 3, and 10.
>target += normal_lpdf(xi | xi_mu, xi_sigma); xi of 1 is conditioned on a prior of 2 with a xi_sigma of .5
>target += normal_lpdf(omega | omega_mu, omega_sigma); omega of 3 is conditioned on a prior of 4 with a omega_sigma of .5
>target += normal_lpdf(alpha | alpha_mu, alpha_sigma); alpha of 10 is conditioned on a prior of 9 with a alpha_sigma of .5
```{r, message=F, warning=F, fig.width=3, fig.height=2, fig.align="center", echo=T}
sumFit <- summary(fit)
postLocat <- sumFit$summary[1,1]
postScale <- sumFit$summary[2,1]
postSkew <- sumFit$summary[3,1]
dfData$post2 <- dsn(grid, postLocat, postScale, postSkew)
dfData$post2 <- dfData$post2/sum(dfData$post2)
p3 <- p3 +
geom_line(data = dfData, aes(grid, post2), color = "green") +
theme(panel.background = element_rect(fill = "white"))
p3
```
### Take-Aways
The Bayesian modeling process is really quite natural. We collect data for a current event *(likelihood)*, and many times we have prior knowledge that can, and should influence our analysis of that current event *(the batter example is classic - what if the batter got a hit his first 10 times up - is he likely to bat 1.0?)*. This results in a posterior that balances evidence and experience to produce a posterior. Now for inference: if we don't have distributions, we don't have probability, if we don't have probability, we can't make an inference. So, we need distributions. We've looked at 3-4 distributions here, but there are many, many others *(RTFM)*. The art and science of fitting distributions and setting parameters is where experience and intuition come in.
Bayesian analysis is an established tool and most Data Science groups will expect you to apply Bayesian analysis and multilevel, pooled models as a default approach in business analysis. That's where you start. Machine learning is really a tool for tuning Bayesian models. We'll look at equations *(within the Stan model)* next.
## Homework
1. Create an R file that walks through the 3 scenarios, but use your own parameters *(different from mine - no copy and paste - THINK through it, try different values and see how it affects the outcome)*
2. Include conclusions and comments in the file
3. Make sure it runs end-to-end. I'm not debugging - no run, no credit.