-
Notifications
You must be signed in to change notification settings - Fork 77
/
12-ijalm.qmd
1178 lines (906 loc) · 71.9 KB
/
12-ijalm.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
engine: knitr
---
# Linear models {#sec-its-just-a-linear-model}
**Prerequisites**
- Read *Regression and Other Stories*, [@gelmanhillvehtari2020]
- Focus on Chapters 6 "Background on regression modeling", 7 "Linear regression with a single predictor", and 10 "Linear regression with multiple predictors", which provide a detailed guide to linear models.
- Read *An Introduction to Statistical Learning with Applications in R*, [@islr]
- Focus on Chapter 3 "Linear Regression", which provides a complementary treatment of linear models from a different perspective.
- Read *Why most published research findings are false*, [@ioannidis2005most]
- Details aspects that can undermine the conclusions drawn from statistical models.
**Key concepts and skills**
- Linear models are a key component of statistical inference and enable us to quickly explore a wide range of data.
- Simple and multiple linear regression model a continuous outcome variable as a function of one, and multiple, predictor variables, respectively.
- Linear models tend to focus on either inference or prediction.
**Software and packages**
- Base R [@citeR]
- `beepr` [@beepr]
- `broom` [@broom]
- `broom.mixed` [@mixedbroom]
- `modelsummary` [@citemodelsummary]
- `purrr` [@citepurrr]
- `rstanarm` [@citerstanarm]
- `testthat` [@testthat]
- `tictoc` [@Izrailev2014]
- `tidyverse` [@tidyverse]
- `tinytable` [@tinytable]
```{r}
#| message: false
#| warning: false
library(beepr)
library(broom)
library(broom.mixed)
library(modelsummary)
library(purrr)
library(rstanarm)
library(testthat)
library(tictoc)
library(tidyverse)
library(tinytable)
```
## Introduction
Linear models\index{linear models} have been used in various forms for a long time. @stigler [p.16] describes how least squares,\index{least squares} which is a method to fit simple linear regression, was associated with foundational problems in astronomy\index{astronomy} in the 1700s, such as determining the motion of the moon and reconciling the non-periodic motion of Jupiter and Saturn.\index{statistics!history of} The fundamental issue at the time with least squares was that of hesitancy by those coming from a statistical background to combine different observations. Astronomers were early to develop a comfort with doing this, possibly because they had typically gathered their observations themselves and knew that the conditions of the data gathering were similar, even though the value of the observation was different. For instance, @stigler [p.28] characterizes Leonhard Euler, the eighteenth century mathematician mentioned in @sec-farm-data, as considering that errors increase as they are aggregated, in comparison to Tobias Mayer, the eighteenth century astronomer, who was comfortable that errors would cancel each other. It took longer, again, for social scientists to become comfortable with linear models, possibly because they were hesitant to group together data they worried were not alike. In one sense astronomers had an advantage because they could compare their predictions with what happened whereas this was more difficult for social scientists [@stigler, p. 163].\index{linear models!errors}
When we build models, we are not discovering "the truth".\index{linear models!not the truth} A model is not, and cannot be, a true representation of reality. We are using the model to help us explore and understand our data. There is no one best model, there are just useful models that help us learn something about the data that we have and hence, hopefully, something about the world from which the data were generated. When we use models, we are trying to understand the world, but there are constraints on the perspective we bring to this. We should not just throw data into a model and hope that it will sort it out. It will not.
> Regression is indeed an oracle, but a cruel one. It speaks in riddles and delights in punishing us for asking bad questions.
>
> @citemcelreath [p. 162]
We use models to understand the world.\index{linear models!appropriate use} We poke, push, and test them. We build them and rejoice in their beauty, and then seek to understand their limits and ultimately destroy them. It is this process that is important, it is this process that allows us to better understand the world; not the outcome, although they may be coincident. When we build models, we need to keep in mind both the world of the model and the broader world that we want to be able to speak about. The datasets that we have are often unrepresentative of real-world populations in certain ways. Models trained on such data are not worthless, but they are also not unimpeachable. To what extent does the model teach us about the data that we have? To what extent do the data that we have reflect the world about which we would like to draw conclusions? We need to keep such questions front of mind.
A lot of statistical methods commonly used today were developed for situations such as astronomy\index{astronomy} and agriculture.\index{statistics!history of} Ronald Fisher, introduced in @sec-hunt-data, published @fisherresearchworkers while he worked at an agricultural research institution. \index{Fisher, Ronald!randomization} But many of the subsequent uses in the twentieth and twenty-first centuries concern applications that may have different properties. Statistical validity relies on assumptions, and so while what is taught is correct, our circumstances may not meet the starting criteria. Statistics is often taught as though it proceeds through some idealized process where a hypothesis appears, is tested against some data that similarly appears, and is either confirmed or not.\index{statistics} But that is not what happens, in practice. We react to incentives. We dabble, guess, and test, and then follow our intuition, backfilling as we need. All of this is fine. But it is not a world in which a traditional null hypothesis, which we will get to later, holds completely. This means concepts such as p-values and power lose some of their meaning. While we need to understand these foundations, we also need to be sophisticated enough to know when we need to move away from them.
Statistical checks are widely used in modeling. And an extensive suite of them is available. But automated testing of code and data is also important. For instance, @wakefieldestimates built a model of excess deaths in a variety of countries, to estimate the overall death toll from the pandemic. After initially releasing the model, which had been extensively manually checked for statistical issues and reasonableness, some of the results were reexamined and it was found that the estimates for Germany and Sweden were oversensitive. The authors quickly addressed the issue, but the integration of automated testing of expected values for the coefficients, in addition to the usual manual statistical checks, would go some way to enabling us to have more faith in the models of others.
::: {.content-visible when-format="pdf"}
In this chapter we begin with simple linear regression, and then move to multiple linear regression, the difference being the number of explanatory variables that we allow. We go through two approaches for each of these: base R, in particular the `lm()` and `glm()` functions, which are useful when we want to quickly use the models in EDA; and `rstanarm` for when we are interested in inference. In general, a model is either optimized for inference or prediction. A focus on prediction is one of the hallmarks of machine learning. For historical reasons that has tended to be dominated by Python, although `tidymodels` [@citeTidymodels] has been developed in R. Due to the need to also introduce Python, we devote the ["Prediction" Online Appendix](https://tellingstorieswithdata.com/27-prediction.html) to various approaches focused on prediction. Regardless of the approach we use, the important thing to remember that we are just doing something akin to fancy averaging, and our results always reflect the biases and idiosyncrasies of the dataset.
:::
::: {.content-visible unless-format="pdf"}
In this chapter we begin with simple linear regression, and then move to multiple linear regression, the difference being the number of explanatory variables that we allow. We go through two approaches for each of these: base R, in particular the `lm()` and `glm()` functions, which are useful when we want to quickly use the models in EDA; and `rstanarm` for when we are interested in inference. In general, a model is either optimized for inference or prediction. A focus on prediction is one of the hallmarks of machine learning. For historical reasons that has tended to be dominated by Python, although `tidymodels` [@citeTidymodels] has been developed in R. Due to the need to also introduce Python, we devote the @sec-predictingpythons to various approaches focused on prediction. Regardless of the approach we use, the important thing to remember that we are just doing something akin to fancy averaging, and our results always reflect the biases and idiosyncrasies of the dataset.
:::
Finally a note on terminology and notation. For historical and context-specific reasons there are a variety of terms used to describe the same idea across the literature. We follow @gelmanhillvehtari2020 and use the terms "outcome" and "predictor", we follow the frequentist notation of @islr, and the Bayesian model specification of @citemcelreath.
## Simple linear regression
When we are interested in the relationship of some continuous outcome variable, say $y$, and some predictor variable, say $x$, we can use simple linear regression.\index{simple linear regression} This is based on the Normal, also called "Gaussian", distribution, but it is not these variables themselves that are normally distributed.\index{distribution!Normal} The Normal distribution is determined by two parameters, the mean, $\mu$, and the standard deviation, $\sigma$ [@pitman, p. 94]:
$$y = \frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}z^2},$$
where $z = (x - \mu)/\sigma$ is the difference between $x$ and the mean, scaled by the standard deviation. @Altman1995 provide an overview of the Normal distribution.\index{distribution!Normal}
::: {.content-visible when-format="pdf"}
As introduced in the ["R essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html), we use `rnorm()` to simulate data from the Normal distribution.
:::
::: {.content-visible unless-format="pdf"}
As introduced in [Appendix -@sec-r-essentials], we use `rnorm()` to simulate data from the Normal distribution.
:::
```{r}
#| message: false
#| warning: false
set.seed(853)
normal_example <-
tibble(draws = rnorm(n = 20, mean = 0, sd = 1))
normal_example |> pull(draws)
```
Here we specified 20 draws from a Normal distribution with a true mean, $\mu$, of zero and a true standard deviation, $\sigma$, of one.\index{distribution!Normal} When we deal with real data, we will not know the true value of these, and we want to use our data to estimate them. We can create an estimate of the mean, $\hat{\mu}$, and an estimate of the standard deviation, $\hat{\sigma}$, with the following estimators:
$$
\begin{aligned}
\hat{\mu} &= \frac{1}{n} \times \sum_{i = 1}^{n}x_i\\
\hat{\sigma} &= \sqrt{\frac{1}{n-1} \times \sum_{i = 1}^{n}\left(x_i - \hat{\mu}\right)^2}
\end{aligned}
$$
If $\hat{\sigma}$ is the estimate of the standard deviation, then the standard error (SE) of the estimate of the mean, $\hat{\mu}$, is:
$$\mbox{SE}(\hat{\mu}) = \frac{\hat{\sigma}}{\sqrt{n}}.$$
The standard error is a comment about the estimate of the mean compared with the actual mean, while the standard deviation is a comment about how widely distributed the data are.\index{standard error}\index{standard deviation}^[Given the small sample size in the example of 20, we strictly should use the finite-sample adjustment, but as this is not the focus of this book we will move forward with the general approach.]
We can implement these in code using our simulated data to see how close our estimates are.
```{r}
#| echo: true
#| eval: true
#| label: tbl-meanstdexample
#| tbl-cap: "Estimates of the mean and standard deviation based on the simulated data"
estimated_mean <-
sum(normal_example$draws) / nrow(normal_example)
normal_example <-
normal_example |>
mutate(diff_square = (draws - estimated_mean) ^ 2)
estimated_standard_deviation <-
sqrt(sum(normal_example$diff_square) / (nrow(normal_example) - 1))
estimated_standard_error <-
estimated_standard_deviation / sqrt(nrow(normal_example))
tibble(mean = estimated_mean,
sd = estimated_standard_deviation,
se = estimated_standard_error) |>
tt() |>
style_tt(j = 1:3, align = "lrr") |>
format_tt(digits = 2, num_mark_big = ",", num_fmt = "decimal") |>
setNames(c(
"Estimated mean",
"Estimated standard deviation",
"Estimated standard error"
))
```
We should not be too worried that our estimates are slightly off (@tbl-meanstdexample), relative to the "true" mean and SD of 0 and 1. We only considered 20 observations. It will typically take a larger number of draws before we get the expected shape, and our estimated parameters get close to the actual parameters, but it will almost surely happen (@fig-normaldistributiontakingshape). @wasserman [p. 76] considers our certainty of this, which is due to the Law of Large Numbers,\index{Law of Large Numbers} as a crowning accomplishment of probability, although @wood [p. 15], perhaps more prosaically, describes it as "almost" a statement "of the obvious"!\index{distribution!Normal}
```{r}
#| eval: true
#| fig-cap: "The Normal distribution takes its familiar shape as the number of draws increases"
#| include: true
#| label: fig-normaldistributiontakingshape
#| message: false
#| warning: false
set.seed(853)
normal_takes_shapes <-
map_dfr(c(2, 5, 10, 50, 100, 500, 1000, 10000, 100000),
~ tibble(
number_draws = rep(paste(.x, "draws"), .x),
draws = rnorm(.x, mean = 0, sd = 1)
))
normal_takes_shapes |>
mutate(number_draws = as_factor(number_draws)) |>
ggplot(aes(x = draws)) +
geom_density() +
theme_minimal() +
facet_wrap(vars(number_draws),
scales = "free_y") +
labs(x = "Draw",
y = "Density")
```
When we use simple linear regression, we assume that our relationship is characterized by the variables and the parameters. If we have two variables, $Y$ and $X$, then we could characterize a linear relationship between these as:
$$
Y = \beta_0 + \beta_1 X + \epsilon.
$$ {#eq-xandy}
Here, there are two parameters, also referred to as coefficients: the intercept, $\beta_0$, and the slope, $\beta_1$. In @eq-xandy we are saying that the expected value of $Y$ is $\beta_0$ when $X$ is 0, and that the expected value of $Y$ will change by $\beta_1$ units for every one unit change in $X$. We may then take this relationship to the data that we have to estimate these parameters. The $\epsilon$ is noise and accounts for deviations away from this relationship. It is this noise that we generally assume to be normally distributed, and it is this that leads to $Y \sim N(\beta, \sigma^2)$.
### Simulated example: running times
To make this example concrete, we revisit an example from @sec-clean-and-prepare about the time it takes someone to run five kilometers, compared with the time it takes them to run a marathon (@fig-fivekmvsmarathon-1).\index{simulation!running times} We specified a relationship of 8.4, as that is roughly the ratio between a five-kilometer run and the 42.2-kilometer distance of a marathon. To help the reader, we include the simulation code again here. Notice that it is the noise that is normally distributed, not the variables.\index{distribution!Normal} We do not require the variables themselves to be normally distributed in order to use linear regression.
```{r}
#| eval: true
#| include: true
#| label: fig-simulatemarathondata
#| message: false
#| warning: false
set.seed(853)
num_observations <- 200
expected_relationship <- 8.4
fast_time <- 15
good_time <- 30
sim_run_data <-
tibble(
five_km_time =
runif(n = num_observations, min = fast_time, max = good_time),
noise = rnorm(n = num_observations, mean = 0, sd = 20),
marathon_time = five_km_time * expected_relationship + noise
) |>
mutate(
five_km_time = round(x = five_km_time, digits = 1),
marathon_time = round(x = marathon_time, digits = 1)
) |>
select(-noise)
```
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| label: fig-fivekmvsmarathon
#| fig-cap: "Simulated data of the relationship between the time to run five kilometers and a marathon"
#| fig-subcap: ["Distribution of simulated data", "With one linear best-fit line illustrating the implied relationship", "Including standard errors"]
#| layout-ncol: 2
base_plot <-
sim_run_data |>
ggplot(aes(x = five_km_time, y = marathon_time)) +
geom_point(alpha = 0.5) +
labs(
x = "Five-kilometer time (minutes)",
y = "Marathon time (minutes)"
) +
theme_classic()
# Panel (a)
base_plot
# Panel (b)
base_plot +
geom_smooth(
method = "lm",
se = FALSE,
color = "black",
linetype = "dashed",
formula = "y ~ x"
)
# Panel (c)
base_plot +
geom_smooth(
method = "lm",
se = TRUE,
color = "black",
linetype = "dashed",
formula = "y ~ x"
)
```
In this simulated example, we know the true values of $\beta_0$ and $\beta_1$, which are zero and 8.4, respectively. But our challenge is to see if we can use only the data, and simple linear regression, to recover them. That is, can we use $x$, which is the five-kilometer time, to produce estimates of $y$, which is the marathon time, and that we denote by $\hat{y}$ (by convention, hats are used to indicate that these are, or will be, estimated values) [@islr, p. 61]:
$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x.$$
This involves estimating values for $\beta_0$ and $\beta_1$. But how should we estimate these coefficients? Even if we impose a linear relationship there are many options because many straight lines could be drawn. But some of those lines would fit the data better than others.
One way we could define a line as being "better" than another, is if it is as close as possible to each of the known $x$ and $y$ combinations. There are a lot of candidates for how we define as "close as possible", but one is to minimize the residual sum of squares.\index{residual sum of squares} To do this we produce estimates for $\hat{y}$ based on some guesses of $\hat{\beta}_0$ and $\hat{\beta}_1$, given the $x$. We then work out how wrong, for every observation $i$, we were [@islr, p. 62]:
$$e_i = y_i - \hat{y}_i.$$
To compute the residual sum of squares (RSS), we sum the errors across all the points (taking the square to account for negative differences) [@islr, p. 62]:
$$\mbox{RSS} = e^2_1+ e^2_2 +\dots + e^2_n.$$
This results in one linear best-fit line (@fig-fivekmvsmarathon-2), but it is worth reflecting on all the assumptions and decisions that it took to get us to this point.\index{linear regression!best fit}
Underpinning our use of simple linear regression is a belief that there is some "true" relationship between $X$ and $Y$. And that this is a linear function of $X$. We do not, and cannot, know the "true" relationship between $X$ and $Y$. All we can do is use our sample to estimate it. But because our understanding depends on that sample, for every possible sample, we would get a slightly different relationship, as measured by the coefficients.
That $\epsilon$ is a measure of our error---what does the model not know in the small, contained world of the dataset? But it does not tell us whether the model is appropriate outside of the dataset (think, by way of analogy, to the concepts of internal and external validity for experiments introduced in @sec-hunt-data). That requires our judgement and experience.
We can conduct simple linear regression with `lm()` from base R.\index{simple linear regression!base R} We specify the outcome variable first, then `~`, followed by the predictor. The outcome variable is the variable of interest, while the predictor is the basis on which we consider that variable. Finally, we specify the dataset.
Before we run a regression, we may want to include a quick check of the class of the variables, and the number of observations, just to ensure that it corresponds with what we were expecting, although we may have done that earlier in the workflow.\index{simple linear regression!sanity checks} And after we run it, we may check that our estimate seems reasonable. For instance, (pretending that we had not ourselves imposed this in the simulation) based on our knowledge of the respective distances of a five-kilometer run and a marathon, we would expect $\beta_1$ to be somewhere between six and ten.
```{r}
# Check the class and number of observations are as expected
stopifnot(
class(sim_run_data$marathon_time) == "numeric",
class(sim_run_data$five_km_time) == "numeric",
nrow(sim_run_data) == 200
)
sim_run_data_first_model <-
lm(
marathon_time ~ five_km_time,
data = sim_run_data
)
stopifnot(between(
sim_run_data_first_model$coefficients[2],
6,
10
))
```
To quickly see the result of the regression, we can use `summary()`.
```{r}
summary(sim_run_data_first_model)
```
But we can also use `modelsummary()` from `modelsummary` (@tbl-modelsummaryfivekmonly). The advantage of that approach is that we get a nicely formatted table. We focus initially on those results in the "Five km only" column.
```{r}
#| eval: true
#| echo: false
sim_run_data <-
sim_run_data |>
mutate(centered_time = five_km_time - mean(sim_run_data$five_km_time))
sim_run_data_centered_model <-
lm(
marathon_time ~ centered_time,
data = sim_run_data
)
```
```{r}
#| label: tbl-modelsummaryfivekmonly
#| tbl-cap: "Explaining marathon times based on five-kilometer run times"
modelsummary(
list(
"Five km only" = sim_run_data_first_model,
"Five km only, centered" = sim_run_data_centered_model
),
fmt = 2
)
```
The top half of @tbl-modelsummaryfivekmonly provides our estimated coefficients and standard errors in brackets. And the second half provides some useful diagnostics, which we will not dwell too much in this book. The intercept in the "Five km only" column is the marathon time associated with a hypothetical five-kilometer time of zero minutes. Hopefully this example illustrates the need to always interpret the intercept coefficient carefully. And to disregard it at times. For instance, in this circumstance, we know that the intercept should be zero, and it is just being set to around four because that is the best fit given all the observations were for five-kilometer times between 15 and 30.
The intercept becomes more interpretable when we run the regression using centered five-kilometer time, which we do in the "Five km only, centered" column of @tbl-modelsummaryfivekmonly.\index{simple linear regression!centered} That is, for each of the five-kilometer times we subtract the mean five-kilometer time. In this case, the intercept is interpreted as the expected marathon time for someone who runs five kilometers in the average time. Notice that the slope estimate is unchanged it is only the intercept that has changed.
```{r}
#| eval: false
#| echo: true
sim_run_data <-
sim_run_data |>
mutate(centered_time = five_km_time - mean(sim_run_data$five_km_time))
sim_run_data_centered_model <-
lm(
marathon_time ~ centered_time,
data = sim_run_data
)
```
Following @gelmanhillvehtari2020 [p. 84] we recommend considering the coefficients as comparisons, rather than effects.\index{simple linear regression!coefficient interpretation} And to use language that makes it clear these are comparisons, on average, based on one dataset. For instance, we may consider that the coefficient on the five-kilometer run time shows how different individuals compare. When comparing the marathon times of individuals in our dataset whose five-kilometer run time differed by one minute, on average we find their marathon times differ by about eight minutes. This makes sense seeing as a marathon is roughly that many times longer than a five-kilometer run.
We can use `augment()` from `broom` to add the fitted values and residuals to our original dataset. This allows us to plot the residuals (@fig-fivekmvsmarathonresids).
```{r}
sim_run_data <-
augment(
sim_run_data_first_model,
data = sim_run_data
)
```
```{r}
#| eval: true
#| include: true
#| message: false
#| warning: false
#| label: fig-fivekmvsmarathonresids
#| layout-ncol: 2
#| fig-cap: "Residuals from the simple linear regression with simulated data on the time someone takes to run five kilometers and a marathon"
#| fig-subcap: ["Distribution of residuals", "Residuals by five-kilometer time", "Residuals by marathon time", "Comparing the estimated time with the actual time"]
# Plot a)
ggplot(sim_run_data, aes(x = .resid)) +
geom_histogram(binwidth = 1) +
theme_classic() +
labs(y = "Number of occurrences", x = "Residuals")
# Plot b)
ggplot(sim_run_data, aes(x = five_km_time, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dotted", color = "grey") +
theme_classic() +
labs(y = "Residuals", x = "Five-kilometer time (minutes)")
# Plot c)
ggplot(sim_run_data, aes(x = marathon_time, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dotted", color = "grey") +
theme_classic() +
labs(y = "Residuals", x = "Marathon time (minutes)")
# Plot d)
ggplot(sim_run_data, aes(x = marathon_time, y = .fitted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
theme_classic() +
labs(y = "Estimated marathon time", x = "Actual marathon time")
```
<!-- We want our estimate to be unbiased. When we say our estimate is unbiased, we are trying to say that even though for some sample our estimate might be too high, and for another sample our estimate might be too low, eventually if we have a lot of data then our estimate would be the same as the population. That is, an estimator is unbiased if it does not systematically over- or under-estimate [@islr, p. 65]. -->
We want to try to speak to the "true" relationship, so we need to try to capture how much we think our understanding depends on the sample that we have to analyze. And this is where the standard error comes in.\index{standard error} It guides us, based on many assumptions, about how to think about the estimates of parameters based on the data that we have (@fig-fivekmvsmarathon-3). Part of this is captured by the fact that standard errors are a function of sample size $n$, and as the sample size increases, the standard errors decrease.
The most common way to summarize the range of uncertainty for a coefficient is to transform its standard error into a confidence interval.\index{confidence!interval} These intervals are often misunderstood to represent a statement of probability about a given realization of the coefficients (i.e. $\hat\beta$). In reality, confidence intervals are a statistic whose properties can only be understood "in expectation" (which is equivalent to repeating an experiment many times). A 95 per cent confidence interval is a range, such that there is "approximately a 95 per cent chance that the" range contains the population parameter, which is typically unknown [@islr, p. 66].\index{confidence!interval}
<!-- This does not mean that when you generate a 95 per cent confidence interval, there is a 95 per cent chance the "true" coefficient lies within this range. -->
<!-- Unlike Bayesian statistics, the frequentist paradigm is not able to make probabilistic claims about coefficients during inference. -->
When the coefficient follows a Normal distribution, the lower and upper end of a 95 per cent confidence interval range will be about $\hat{\beta_1} \pm 2 \times \mbox{SE}\left(\hat{\beta_1}\right)$. For instance, in the case of the marathon time example, the lower end is $8.2 - 2 \times 0.3 = 7.6$ and the upper end is $8.2 + 2 \times 0.3 = 8.8$, and the true value (which we only know in this case because we simulated it) is 8.4.
We can use this machinery to test claims.\index{simple linear regression!hypothesis testing} For instance, we could claim that there is no relationship between $X$ and $Y$, i.e. $\beta_1 = 0$, as an alternative to a claim that there is some relationship between $X$ and $Y$, i.e. $\beta_1 \neq 0$. This is the null hypothesis testing approach mentioned earlier. In @sec-hunt-data we needed to decide how much evidence it would take to convince us that our tea taster could distinguish whether milk or tea had been added first. In the same way, here we need to decide if the estimate of $\beta_1$, which we denote as $\hat{\beta}_1$, is far enough away from zero for us to be comfortable claiming that $\beta_1 \neq 0$. If we were very confident in our estimate of $\beta_1$ then it would not have to be far, but if we were not, then it would have to be substantial. For instance, if the confidence interval contains zero, then we lack evidence to suggest $\beta_1 \neq 0$.\index{confidence!interval} The standard error of $\hat{\beta}_1$ does an awful lot of work here in accounting for a variety of factors, only some of which it can actually account for, as does our choice as to what it would take to convince us.
We use this standard error and $\hat{\beta}_1$ to get the "test statistic" or t-statistic:
$$t = \frac{\hat{\beta}_1}{\mbox{SE}(\hat{\beta}_1)}.$$
And we then compare our t-statistic to the t-distribution of @student1908probable to compute the probability of getting this absolute t-statistic or a larger one, if it was the case that $\beta_1 = 0$. This probability is the p-value. A smaller p-value means there is a smaller "probability of observing something at least as extreme as the observed test statistic" [@gelmanhillvehtari2020, p. 57]. Here we use the t-distribution of @student1908probable rather than the Normal distribution, because the t-distribution has slightly larger tails than a standard Normal distribution.
> Words! Mere words! How terrible they were! How clear, and vivid, and cruel! One could not escape from them. And yet what a subtle magic there was in them! They seemed to be able to give a plastic form to formless things, and to have a music of their own as sweet as that of viol or of lute. Mere words! Was there anything so real as words?
>
> *The Picture of Dorian Gray* [@wilde].
We will not make much use of p-values\index{simple linear regression!p-values}\index{p-values} in this book because they are a specific and subtle concept. They are difficult to understand and easy to abuse. Even though they are "little help" for "scientific inference" many disciplines are incorrectly fixated on them [@nelderdoesntmiss, p. 257].\index{p-values!incorrect fixation} One issue is that they embody every assumption of the workflow, including everything that went into gathering and cleaning the data. While p-values have implications if all the assumptions were correct, when we consider the full data science workflow there are usually an awful lot of assumptions.\index{p-values!difficulty of verifying} And we do not get guidance from p-values about whether the assumptions are satisfied [@greenland2016statistical p. 339].
A p-value may reject a null hypothesis because the null hypothesis is false, but it may also be that some data were incorrectly gathered or prepared. We can only be sure that the p-value speaks to the hypothesis we are interested in testing if all the other assumptions are correct.\index{p-values!data quality} There is nothing wrong about using p-values, but it is important to use them in sophisticated and thoughtful ways. @coxtalks discusses what this requires.
One application where it is easy to see an inappropriate focus on p-values is in power analysis.\index{p-values!power} Power, in a statistical sense, refers to probability of rejecting a null hypothesis that is false. As power relates to hypothesis testing, it also relates to sample size. There is often a worry that a study is "under-powered", meaning there was not a large enough sample, but rarely a worry that, say, the data were inappropriately cleaned, even though we cannot distinguish between these based only on a p-value. As @meng2018statistical and @Bradley2021 demonstrate, a focus on power can blind us from our responsibility to ensure our data are high quality.
:::{.callout-note}
## Shoulders of giants
Dr Nancy Reid\index{Reid, Nancy} is University Professor in the Department of Statistical Sciences at the University of Toronto.\index{statistics}\index{University of Toronto} After obtaining a PhD in Statistics from Stanford University in 1979, she took a position as a postdoctoral fellow at Imperial College London. She was appointed as an assistant professor at the University of British Columbia in 1980, and then moved to the University of Toronto in 1986, where she was promoted to full professor in 1988 and served as department chair between 1997 and 2002 [@Staicu2017]. Her research focuses on obtaining accurate inference in small-sample regimes and developing inferential procedures for complex models featuring intractable likelihoods. @cox1987parameter examine how re-parameterizing models can simplify inference, @varin2011overview surveys methods for approximating intractable likelihoods, and @reid2003asymptotics overviews inferential procedures in the small-sample regime. Dr Reid was awarded the 1992 COPSS Presidents' Award,\index{COPSS Presidents' Award} the Royal Statistical Society Guy Medal\index{Guy Medal!Silver} in Silver in 2016 and in Gold in 2022,\index{Guy Medal!Gold} and the COPSS Distinguished Achievement Award and Lectureship in 2022.
:::
## Multiple linear regression
To this point we have just considered one explanatory variable. But we will usually have more than one. One approach would be to run separate regressions for each explanatory variable. But compared with separate linear regressions for each, adding more explanatory variables allows for associations between the outcome variable and the predictor variable of interest to be assessed while adjusting for other explanatory variables. The results can be quite different, especially when the explanatory variables are correlated with each other.\index{linear regression!multiple}
We may also like to consider explanatory variables that are not continuous. For instance: pregnant or not; day or night. When there are only two options then we can use a binary variable, which is considered either 0 or 1. If we have a column of character values that only has two values, such as: `c("Myles", "Ruth", "Ruth", "Myles", "Myles", "Ruth")`, then using this as an explanatory variable in the usual regression set-up would mean that it is treated as a binary variable. If there are more than two levels, then we can use a combination of binary variables, where some baseline outcome gets integrated into the intercept.\index{binary variables!predictors}
### Simulated example: running times with rain and humidity
As an example, we add whether it was raining to our simulated relationship between marathon and five-kilometer run times. We then specify that if it was raining, the individual is ten minutes slower than they otherwise would have been.
```{r}
slow_in_rain <- 10
sim_run_data <-
sim_run_data |>
mutate(was_raining = sample(
c("Yes", "No"),
size = num_observations,
replace = TRUE,
prob = c(0.2, 0.8)
)) |>
mutate(
marathon_time = if_else(
was_raining == "Yes",
marathon_time + slow_in_rain,
marathon_time
)
) |>
select(five_km_time, marathon_time, was_raining)
```
We can add additional explanatory variables to `lm()` with `+`. Again, we will include a variety of quick tests for class and the number of observations, and add another about missing values. We may not have any idea what the coefficient of rain should be, but if we did not expect it to make them faster, then we could also add a test of that with a wide interval of non-negative values.
```{r}
stopifnot(
class(sim_run_data$marathon_time) == "numeric",
class(sim_run_data$five_km_time) == "numeric",
class(sim_run_data$was_raining) == "character",
all(complete.cases(sim_run_data)),
nrow(sim_run_data) == 200
)
sim_run_data_rain_model <-
lm(
marathon_time ~ five_km_time + was_raining,
data = sim_run_data
)
stopifnot(
between(sim_run_data_rain_model$coefficients[3], 0, 20)
)
summary(sim_run_data_rain_model)
```
The result, in the second column of @tbl-modelsummaryruntimes, shows that when we compare individuals in our dataset who ran in the rain with those who did not, the ones in the rain tended to have a slower time. And this corresponds with what we expect if we look at a plot of the data (@fig-fivekmvsmarathonbinary-1).
We have included two types of tests here. The ones run before `lm()` check inputs, and the ones run after `lm()` check outputs. We may notice that some of the input checks are the same as earlier. One way to avoid having to rewrite tests many times would be to install and load `testthat` to create a suite of tests of class in say an R file called "class_tests.R", which are then called using `test_file()`.
For instance, we could save the following as "test_class.R" in a dedicated tests folder.
```{r}
#| eval: false
test_that("Check class", {
expect_type(sim_run_data$marathon_time, "double")
expect_type(sim_run_data$five_km_time, "double")
expect_type(sim_run_data$was_raining, "character")
})
```
We could save the following as "test_observations.R"
```{r}
#| eval: false
test_that("Check number of observations is correct", {
expect_equal(nrow(sim_run_data), 200)
})
test_that("Check complete", {
expect_true(all(complete.cases(sim_run_data)))
})
```
And finally, we could save the following as "test_coefficient_estimates.R".
```{r}
#| eval: false
test_that("Check coefficients", {
expect_gt(sim_run_data_rain_model$coefficients[3], 0)
expect_lt(sim_run_data_rain_model$coefficients[3], 20)
})
```
We could then change the regression code to call these test files rather than write them all out.
```{r}
#| message: false
#| eval: false
test_file("tests/test_observations.R")
test_file("tests/test_class.R")
sim_run_data_rain_model <-
lm(
marathon_time ~ five_km_time + was_raining,
data = sim_run_data
)
test_file("tests/test_coefficient_estimates.R")
```
It is important to be clear about what we are looking for in the checks of the coefficients. When we simulate data, we put in place reasonable guesses for what the data could look like, and it is similarly reasonable guesses that we test. A failing test is not necessarily a reason to go back and change things, but instead a reminder to look at what is going on in both, and potentially update the test if necessary.
In addition to wanting to include additional explanatory variables, we may think that they are related to each another. For instance, maybe rain really matters if it is also humid that day. We are interested in the humidity and temperature, but also how those two variables interact (@fig-fivekmvsmarathonbinary-2).\index{linear regression!variable interaction} We can do this by using `*` instead of `+` when we specify the model. When we interact variables in this way, then we almost always need to include the individual variables as well and `lm()` will do this by default. The result is contained in the third column of @tbl-modelsummaryruntimes.
```{r}
slow_in_humidity <- 15
sim_run_data <- sim_run_data |>
mutate(
humidity = sample(c("High", "Low"), size = num_observations,
replace = TRUE, prob = c(0.2, 0.8)),
marathon_time =
marathon_time + if_else(humidity == "High", slow_in_humidity, 0),
weather_conditions = case_when(
was_raining == "No" & humidity == "Low" ~ "No rain, not humid",
was_raining == "Yes" & humidity == "Low" ~ "Rain, not humid",
was_raining == "No" & humidity == "High" ~ "No rain, humid",
was_raining == "Yes" & humidity == "High" ~ "Rain, humid"
)
)
```
```{r}
#| eval: false
#| echo: false
arrow::write_parquet(x = sim_run_data,
sink = "outputs/data/running_data.parquet")
```
```{r}
#| eval: true
#| fig-cap: "Simple linear regression with simulated data on the time someone takes to run five kilometers and a marathon, depending on the weather"
#| include: true
#| label: fig-fivekmvsmarathonbinary
#| message: false
#| warning: false
#| fig-subcap: ["Only whether it was raining", "Whether it was raining and the level of humidity"]
#| layout-nrow: 2
base <-
sim_run_data |>
ggplot(aes(x = five_km_time, y = marathon_time)) +
labs(
x = "Five-kilometer time (minutes)",
y = "Marathon time (minutes)"
) +
theme_classic() +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
base +
geom_point(aes(color = was_raining)) +
geom_smooth(
aes(color = was_raining),
method = "lm",
alpha = 0.3,
linetype = "dashed",
formula = "y ~ x"
) +
labs(color = "Was raining")
base +
geom_point(aes(color = weather_conditions)) +
geom_smooth(
aes(color = weather_conditions),
method = "lm",
alpha = 0.3,
linetype = "dashed",
formula = "y ~ x"
) +
labs(color = "Conditions")
```
```{r}
#| include: true
#| label: tbl-modelsummaryruntimes
#| tbl-cap: "Explaining marathon times based on five-kilometer run times and weather features"
sim_run_data_rain_and_humidity_model <-
lm(
marathon_time ~ five_km_time + was_raining * humidity,
data = sim_run_data
)
modelsummary(
list(
"Five km only" = sim_run_data_first_model,
"Add rain" = sim_run_data_rain_model,
"Add humidity" = sim_run_data_rain_and_humidity_model
),
fmt = 2
)
```
There are a variety of threats to the validity of linear regression estimates, and aspects to think about, particularly when using an unfamiliar dataset. We need to address these when we use it, and usually graphs and associated text are sufficient to assuage most of them. Aspects of concern include:\index{linear regression!threats to validity}
1. Linearity of explanatory variables. We are concerned with whether the predictors enter in a linear way. We can usually be convinced there is enough linearity in our explanatory variables for our purposes by using graphs of the variables.
2. Homoscedasticity of errors. We are concerned that the errors are not becoming systematically larger or smaller throughout the sample. If that is happening, then we term it heteroscedasticity. Again, graphs of errors, such as @fig-fivekmvsmarathonresids-2, are used to convince us of this.
3. Independence of errors. We are concerned that the errors are not correlated with each other. For instance, if we are interested in weather-related measurement such as average daily temperature, then we may find a pattern because the temperature on one day is likely similar to the temperature on another. We can be convinced that we have satisfied this condition by looking at the residuals compared with observed values, such as @fig-fivekmvsmarathonresids-3, or estimates compared with actual outcomes, such as @fig-fivekmvsmarathonresids-4.
4. Outliers and other high-impact observations. Finally, we might be worried that our results are being driven by a handful of observations. For instance, thinking back to @sec-static-communication and Anscombe's Quartet, we notice that linear regression estimates would be heavily influenced by the inclusion of one or two particular points. We can become comfortable with this by considering our analysis on various subsets. For instance, randomly removing some observation in the way we did to the US states in @sec-exploratory-data-analysis.
Those aspects are statistical concerns and relate to whether the model is working. The most important threat to validity and hence the aspect that must be addressed at some length, is whether this model is directly relevant to the research question of interest.
:::{.callout-note}
## Shoulders of giants
Dr Daniela Witten is the Dorothy Gilford Endowed Chair of Mathematical Statistics and Professor of Statistics & Biostatistics at the University of Washington.\index{statistics}\index{biostatistics} After taking a PhD in Statistics from Stanford University in 2010, she joined the University of Washington as an assistant professor. She was promoted to full professor in 2018. One active area of her research is double-dipping which is focused on the effect of sample splitting [@selectiveinference]. She is an author of the influential *Introduction to Statistical Learning* [@islr]. Witten was appointed a Fellow of the American Statistical Association in 2020 and awarded the COPSS Presidents' Award in 2022.\index{COPSS Presidents' Award}
:::
## Building models {#sec-inferencewithbayesianmethods}
@breiman2001statistical describes two cultures of statistical modeling: one focused on inference and the other on prediction.\index{model!building} In general, around the time @breiman2001statistical was published, various disciplines tended to focus on either inference or prediction. For instance, @Jordan2004 describes how statistics\index{statistics} and computer science\index{computer science} had been separate for some time, but how the goals of each field were becoming more closely aligned. The rise of data science, and in particular machine learning has meant there is now a need to be comfortable with both [@Neufeld2021].\index{data science!culture} The two cultures are being brought closer together, and there is an overlap and interaction between prediction and inference.\index{model!building} But their separate evolution means there are still considerable cultural differences. As a small example of this, the term "machine learning" tends to be used in computer science, while the term "statistical learning" tends to be used in statistics, even though they usually refer to the same machinery.
::: {.content-visible when-format="pdf"}
In this book we will focus on inference using the probabilistic programming language Stan\index{Stan} to fit models in a Bayesian\index{Bayesian!modeling} framework, and interface with it using `rstanarm`. Inference and forecasting have different cultures, ecosystems, and priorities. You should try to develop a comfort in both. One way these different cultures manifest is in language choice. The primary language in this book is R, and for the sake of consistency we focus on that here. But there is an extensive culture, especially but not exclusively focused on prediction, that uses Python. We recommend focusing on only one language and approach initially, but after developing that initial familiarity it is important to become multilingual. We introduce prediction based on Python in the ["Prediction" Online Appendix](https://tellingstorieswithdata.com/27-prediction.html).
:::
::: {.content-visible unless-format="pdf"}
In this book we will focus on inference using the probabilistic programming language Stan to fit models in a Bayesian\index{Bayesian!modeling} framework, and interface with it using `rstanarm`. Inference and forecasting have different cultures, ecosystems, and priorities. You should try to develop a comfort in both. One way these different cultures manifest is in language choice. The primary language in this book is R, and for the sake of consistency we focus on that here. But there is an extensive culture, especially but not exclusively focused on prediction, that uses Python. We recommend focusing on only one language and approach initially, but after developing that initial familiarity it is important to become multilingual. We introduce prediction based on Python in [Online Appendix -@sec-predictingpythons].
:::
We will again not go into the details, but running a regression in a Bayesian setting is similar to the frequentist approach that underpins `lm()`.\index{Bayesian!modeling}\index{Bayesian!linear regression} The main difference from a regression point of view, is that the parameters involved in the models (i.e. $\beta_0$, $\beta_1$, etc) are considered random variables, and so themselves have associated probability distributions. In contrast, the frequentist paradigm assumes that any randomness from these coefficients comes from a parametric assumption on the distribution of the error term, $\epsilon$.
Before we run a regression in a Bayesian framework, we need to decide on a starting probability distribution for each of these parameters, which we call a "prior".\index{Bayesian!priors} While the presence of priors adds some additional complication, it has several advantages, and we will discuss the issue of priors in more detail below. This is another reason for the workflow advocated in this book: the simulation stage leads directly to priors. We will again specify the model that we are interested in, but this time we include priors.
$$
\begin{aligned}
y_i|\mu_i, \sigma &\sim \mbox{Normal}(\mu_i, \sigma) \\
\mu_i &= \beta_0 +\beta_1x_i\\
\beta_0 &\sim \mbox{Normal}(0, 2.5) \\
\beta_1 &\sim \mbox{Normal}(0, 2.5) \\
\sigma &\sim \mbox{Exponential}(1) \\
\end{aligned}
$$
We combine information from the data with the priors to obtain posterior distributions for our parameters. Inference is then carried out based on analyzing posterior distributions.\index{Bayesian!posterior}
Another aspect that is different between Bayesian approaches and the way we have been doing modeling to this point, is that Bayesian models will usually take longer to run. Because of this, it can be useful to run the model in a separate R script and then save it with `saveRDS()`.\index{Bayesian!modeling} With sensible Quarto chunk options for "eval" and "echo" (see @sec-reproducible-workflows), the model can then be read into the Quarto document with `readRDS()` rather than being run every time the paper is compiled. In this way, the model delay is only imposed once for a given model. It can also be useful to add `beep()` from `beepr` to the end of the model, to get an audio notification when the model is done.
```{r}
#| echo: true
#| eval: false
#| message: false
#| warning: false
sim_run_data_first_model_rstanarm <-
stan_glm(
formula = marathon_time ~ five_km_time + was_raining,
data = sim_run_data,
family = gaussian(),
prior = normal(location = 0, scale = 2.5),
prior_intercept = normal(location = 0, scale = 2.5),
prior_aux = exponential(rate = 1),
seed = 853
)
beep()
saveRDS(
sim_run_data_first_model_rstanarm,
file = "sim_run_data_first_model_rstanarm.rds"
)
```
```{r}
#| echo: false
#| eval: false
#| message: false
#| warning: false
# INTERNAL
saveRDS(
sim_run_data_first_model_rstanarm,
file = "outputs/model/sim_run_data_first_model_rstanarm.rds"
)
```
```{r}
#| echo: true
#| eval: false
#| warning: false
#| message: false
sim_run_data_first_model_rstanarm <-
readRDS(file = "sim_run_data_first_model_rstanarm.rds")
```
```{r}
#| echo: false
#| eval: true
#| warning: false
#| message: false
sim_run_data_first_model_rstanarm <-
readRDS(file = "outputs/model/sim_run_data_first_model_rstanarm.rds")
```
We use `stan_glm()` with the "gaussian()" family to specify multiple linear regression and the model formula is written in the same way as base R and `rstanarm``.\index{linear regression!Bayesian}\index{Bayesian!linear regression} We have explicitly added the default priors, as we see this as good practice, although strictly this is not necessary.
The estimation results, which are in the first column of @tbl-modelsummarybayesbetter, are not quite what we expect. For instance, the rate of increase in marathon times is estimated to be around three minutes per minute of increase in five-kilometer time, which seems low given the ratio of a five-kilometer run to marathon distance.
### Picking priors
The issue of picking priors is a challenging one and the subject of extensive research. For the purposes of this book, using the `rstanarm` defaults is fine.\index{Bayesian!priors} But even if they are just the default, priors should be explicitly specified in the model and included in the function.\index{model!specification} This is to make it clear to others what has been done. We could use `default_prior_intercept()` and `default_prior_coef()` to find the default priors in `rstanarm` and then explicitly include them in the model.
It is normal to find it difficult to know what prior to specify. Getting started by adapting someone else's `rstanarm` code is perfectly fine. If they have not specified their priors, then we can use the helper function `prior_summary()`, to find out which priors were used.\index{Bayesian!priors}
```{r}
prior_summary(sim_run_data_first_model_rstanarm)
```
We are interested in understanding what the priors imply before we involve any data. We do this by implementing prior predictive checks.\index{Bayesian!prior predictive check} This means simulating from the priors to look at what the model implies about the possible magnitude and direction of the relationships between the explanatory and outcome variables. This process is no different to all the other simulation that we have done to this point.\index{distribution!Normal}\index{distribution!exponential}
```{r}
draws <- 1000
priors <-
tibble(
sigma = rep(rexp(n = draws, rate = 1), times = 16),
beta_0 = rep(rnorm(n = draws, mean = 0, sd = 2.5), times = 16),
beta_1 = rep(rnorm(n = draws, mean = 0, sd = 2.5), times = 16),
five_km_time = rep(15:30, each = draws),
mu = beta_0 + beta_1 * five_km_time
) |>
rowwise() |>
mutate(
marathon_time = rnorm(n = 1, mean = mu, sd = sigma)
)
```
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| label: fig-worstmodelever
#| fig-cap: "Some implications from the priors that were used"
#| layout-ncol: 2
#| fig-subcap: ["Distribution of implied marathon times", "Relationship between 5km and marathon times"]
priors |>
ggplot(aes(x = marathon_time)) +
geom_histogram(binwidth = 10) +
theme_classic()
priors |>
ggplot(aes(x = five_km_time, y = marathon_time)) +
geom_point(alpha = 0.1) +
theme_classic()
```
@fig-worstmodelever suggests our model has been poorly constructed. Not only are there world record marathon times, but there are also negative marathon times! One issue is that our prior for $\beta_1$ does not take in all the information that we know. We know that a marathon is about eight times longer than a five-kilometer run and so we could center the prior for $\beta_1$ around that.\index{Bayesian!priors} Our re-specified model is:
$$
\begin{aligned}
y_i|\mu_i, \sigma &\sim \mbox{Normal}(\mu_i, \sigma) \\
\mu_i &= \beta_0 +\beta_1x_i\\
\beta_0 &\sim \mbox{Normal}(0, 2.5) \\
\beta_1 &\sim \mbox{Normal}(8, 2.5) \\
\sigma &\sim \mbox{Exponential}(1) \\
\end{aligned}
$$
And we can see from prior prediction checks that it seems more reasonable (@fig-secondworstmodelever).\index{distribution!Normal}\index{distribution!exponential}
```{r}
draws <- 1000
updated_priors <-
tibble(
sigma = rep(rexp(n = draws, rate = 1), times = 16),
beta_0 = rep(rnorm(n = draws, mean = 0, sd = 2.5), times = 16),
beta_1 = rep(rnorm(n = draws, mean = 8, sd = 2.5), times = 16),
five_km_time = rep(15:30, each = draws),
mu = beta_0 + beta_1 * five_km_time
) |>
rowwise() |>
mutate(
marathon_time = rnorm(n = 1, mean = mu, sd = sigma)
)
```
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| fig-cap: "Updated priors"
#| label: fig-secondworstmodelever
#| layout-ncol: 2
#| fig-subcap: ["Distribution of implied marathon times", "Relationship between 5km and marathon times"]
updated_priors |>
ggplot(aes(x = marathon_time)) +
geom_histogram(binwidth = 10) +
theme_classic()
updated_priors |>
ggplot(aes(x = five_km_time, y = marathon_time)) +
geom_point(alpha = 0.1) +
theme_classic()
```
If we were not sure what to do then `rstanarm` could help to improve the specified priors, by scaling them based on the data.\index{Bayesian!priors scaling} Specify the prior that you think is reasonable, even if this is just the default, and include it in the function, but also include "autoscale = TRUE", and `rstanarm` will adjust the scale. When we re-run our model with these updated priors and allowing auto-scaling we get much better results, which are in the second column of @tbl-modelsummarybayesbetter. You can then add those to the written-out model instead.
```{r}
#| include: true
#| message: false
#| warning: false
#| eval: false
sim_run_data_second_model_rstanarm <-
stan_glm(
formula = marathon_time ~ five_km_time + was_raining,
data = sim_run_data,
family = gaussian(),
prior = normal(location = 8, scale = 2.5, autoscale = TRUE),
prior_intercept = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(rate = 1, autoscale = TRUE),
seed = 853
)
saveRDS(
sim_run_data_second_model_rstanarm,
file = "sim_run_data_second_model_rstanarm.rds"
)
```
```{r}
#| include: false
#| message: false
#| warning: false
#| eval: false
# INTERNAL
saveRDS(
sim_run_data_second_model_rstanarm,
file = "outputs/model/sim_run_data_second_model_rstanarm.rds"
)
```
```{r}
#| eval: true
#| include: false
#| warning: false
#| message: false
sim_run_data_second_model_rstanarm <-
readRDS(file = "outputs/model/sim_run_data_second_model_rstanarm.rds")
```
```{r}
#| label: tbl-modelsummarybayesbetter
#| tbl-cap: "Forecasting and explanatory models of marathon times based on five-kilometer run times"
#| warning: false
modelsummary(
list(
"Non-scaled priors" = sim_run_data_first_model_rstanarm,
"Auto-scaling priors" = sim_run_data_second_model_rstanarm
),
fmt = 2
)
```
As we used the "autoscale = TRUE" option, it can be helpful to look at how the priors were updated with `prior_summary()` from `rstanarm`.
```{r}
prior_summary(sim_run_data_second_model_rstanarm)
```
### Posterior distribution
Having built a Bayesian model, we may want to look at what it implies (@fig-ppcheckandposteriorvsprior). One way to do this is to consider the posterior distribution.\index{Bayesian!posterior}
One way to use the posterior distribution is to consider whether the model is doing a good job of fitting the data. The idea is that if the model is doing a good job of fitting the data, then the posterior should be able to be used to simulate data that are like the actual data [@bayesianworkflow]. We can implement a posterior predictive check\index{Bayesian!posterior predictive check} with `pp_check()` from `rstanarm` (@fig-ppcheckandposteriorvsprior-1). This compares the actual outcome variable with simulations from the posterior distribution. And we can compare the posterior with the prior with `posterior_vs_prior()` to see how much the estimates change once data are taken into account (@fig-ppcheckandposteriorvsprior-2). Helpfully, `pp_check()` and `posterior_vs_prior()` return `ggplot2` objects so we can modify the look of them in the normal way we manipulate graphs. These checks and discussion would typically just be briefly mentioned in the main content of a paper, with the detail and graphs added to a dedicated appendix.
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
#| label: fig-ppcheckandposteriorvsprior
#| layout-ncol: 2
#| fig-cap: "Examining how the model fits, and is affected by, the data"
#| fig-subcap: ["Posterior prediction check", "Comparing the posterior with the prior"]
pp_check(sim_run_data_second_model_rstanarm) +
theme_classic() +
theme(legend.position = "bottom")
posterior_vs_prior(sim_run_data_second_model_rstanarm) +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom") +
coord_flip()
```
With a simple model like this, the differences between the prediction and inference approaches are minimal. But as model or data complexity increases these differences can become important.
We have already discussed confidence intervals\index{confidence!interval} and the Bayesian equivalent to a confidence interval is called a "credibility interval", and reflects two points where there is a certain probability mass between them, in this case 95 per cent.\index{Bayesian!credibility interval}
<!-- @tbl-modelsummarybayesbetter shows confidence intervals for `tidymodels` and credible intervals for `rstanarm`, beneath the estimates. -->
Bayesian estimation provides a distribution for each coefficient. This means there is an infinite number of points that we could use to generate this interval.
<!-- We will include standard errors credible interval points in tables, because this is commonly needed. -->
The entire distribution should be shown graphically (@fig-credibleintervals). This might be using a cross-referenced appendix.
```{r}
#| label: fig-credibleintervals
#| fig-cap: "Credible intervals"
plot(
sim_run_data_second_model_rstanarm,
"areas"
)
```
### Diagnostics
The final aspect that we would like to check is a practical matter. `rstanarm` uses a sampling algorithm called Markov chain Monte Carlo (MCMC) to obtain samples from the posterior distributions of interest. We need to quickly check for the existence of signs that the algorithm ran into issues. We consider a trace plot, such as @fig-stanareyouokay-1, and a Rhat plot, such as @fig-stanareyouokay-2. These would typically go in a cross-referenced appendix.\index{Bayesian!trace plot}\index{Bayesian!Rhat plot}
```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| label: fig-stanareyouokay
#| fig-cap: "Checking the convergence of the MCMC algorithm"
#| fig-subcap: ["Trace plot", "Rhat plot"]
#| layout-ncol: 2