forked from PsyTeachR/msc-data-skills
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-sim.Rmd
716 lines (519 loc) · 20.3 KB
/
08-sim.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
# Probability & Simulation {#sim}
## Learning Objectives
### Basic
1. Understand what types of data are best modeled by different distributions
+ uniform
+ binomial
+ normal
+ poisson
2. Generate and plot data randomly sampled from the above distributions
3. Test sampled distributions against a null hypothesis
+ exact binomial test
+ t-test (1-sample, independent samples, paired samples)
+ correlation (pearson, kendall and spearman)
4. Define the following statistical terms:
+ p-value
+ alpha
+ power
+ false positive (type I error)
+ false negative (type II error)
+ confidence interval
### Intermediate
5. Create a function to generate a sample with specific properties and run an inferential test
6. Calculate power using `replicate` and a sampling function
7. Calculate the minimum sample size for a specific power level and design
### Advanced
8. Generate 3+ variables from a multivariate normal distribution and plot them
## Resources
* [Chapter 21: Iteration](http://r4ds.had.co.nz/iteration.html) of *R for Data Science*
* [Improving your statistical inferences](https://www.coursera.org/learn/statistical-inferences/) on Coursera (week 1)
## Distributions
Simulating data is a very powerful way to test your understanding of statistical
concepts. We are going to use simulations to learn the basics of probability.
```{r libraries, results = 'hide', warning = FALSE, message = FALSE}
# libraries needed for these examples
library(tidyverse)
library(MASS)
```
```{r set-seed, echo = FALSE}
set.seed(8675309)
```
### Uniform Distribution
The uniform distribution is the simplest distribution. All numbers in the range
have an equal probability of being sampled.
#### Sample continuous distribution
`runif(n, min=0, max=1)`
Use `runif()` to sample from a continuous uniform distribution.
```{r runif}
runif(10, min = 0, max = 1)
```
#### Sample discrete distribution
`sample(x, size, replace = FALSE, prob = NULL)`
Use `sample()` to sample from a discrete distribution.
Simulate a single roll of a 6-sided die.
```{r sample}
sample(6, 1)
```
Simulate 10 rolls of a 6-sided die. Set `replace` to `TRUE` so each roll is
independent. See what happens if you set `replace` to `FALSE`.
```{r sample-replace}
sample(6, 10, replace = TRUE)
```
You can also use sample to sample from a list of named outcomes.
```{r sample-list}
pet_types <- c("cat", "dog", "ferret", "bird", "fish")
sample(pet_types, 10, replace = TRUE)
```
Ferrets are a much less common pet than cats and dogs, so our sample isn't very
realistic. You can set the probabilities of each item in the list with the `prob`
argument.
```{r sample-prob}
pet_types <- c("cat", "dog", "ferret", "bird", "fish")
pet_prob <- c(0.3, 0.4, 0.1, 0.1, 0.1)
sample(pet_types, 10, replace = TRUE, prob = pet_prob)
```
### Binomial Distribution
The binomial distribution is useful for modeling binary data, where each
observation can have one of two outcomes, like success/failure, yes/no or
head/tails.
#### Sample distribution
`rbinom(n, size, prob)`
The `rbinom` function will generate a random binomial distribution.
* `n` = number of observations
* `size` = number of trials
* `prob` = probability of success on each trial
Coin flips are a typical example of a binomial distribution, where we can assign
head to 1 and tails to 0.
20 individual coin flips of a fair coin
```{r rbinom-fair}
rbinom(20, 1, 0.5)
```
20 individual coin flips of a baised (0.75) coin
```{r rbinom-bias}
rbinom(20, 1, 0.75)
```
You can generate the total number of heads in 1 set of 20 coin flips by setting
`size` to 20 and `n` to 1.
```{r rbinom-size}
rbinom(1, 20, 0.75)
```
You can generate more sets of 20 coin flips by increasing the `n`.
```{r rbinom-n}
rbinom(10, 20, 0.5)
```
You should always check your randomly generated data to check that it makes sense.
For large samples, it's easiest to do that graphically. A histogram is usually
the best choice for plotting binomial data.
```{r sim_flips}
flips <- rbinom(1000, 20, 0.5)
ggplot() +
geom_histogram(
aes(flips),
binwidth = 1,
fill = "white",
color = "black"
)
```
```{block, type="info"}
Run the simulation above several times, noting how the histogram changes. Try changing the values of `n`, `size`, and `prob`.
```
#### Exact binomial test
`binom.test(x, n, p)`
You can test a binomial distribution against a specific probability using the
exact binomial test.
* `x` = the number of successes
* `n` = the number of trials
* `p` = hypothesised probability of success
Here we can test a series of 10 coin flips from a fair coin and a biased coin
against the hypothesised probability of 0.5 (even odds).
```{r binom-test}
n <- 10
fair_coin <- rbinom(1, n, 0.5)
biased_coin <- rbinom(1, n, 0.6)
binom.test(fair_coin, n, p = 0.5)
binom.test(biased_coin, n, p = 0.5)
```
```{block, type="info"}
Run the code above several times, noting the p-values for the fair and biased coins.
Alternatively, you can [simulate coin flips](https://debruine.shinyapps.io/coinsim/)
online and build up a graph of results and p-values.
* How does the p-value vary for the fair and biased coins?
* What happens to the confidence intervals if you increase n from 10 to 100?
* What criterion would you use to tell if the observed data indicate the coin is fair or biased?
* How often do you conclude the fair coin is biased (false positives)?
* How often do you conclude the biased coin is fair (false negatives)?
```
#### False positives & negatives
The probability that a test concludes the fair coin is biased is called the
*false positive rate* (or _Type I Error Rate_). The *alpha* is the false positive
rate we accept for a test. This is traditionally set at 0.05, but there are good
arguments for setting a different criterion in some circumstances.
The probability that a test concludes the biased coin is fair is called the
*false negative rate* (of _Type II Error Rate_). The *power* of a test is 1
minus its false negative rate (i.e., the *true positive rate*). Power depends
on how biased the coin is and how many samples we take.
#### Sampling function
To estimate these rates, we need to repeat the sampling above many times.
A function is ideal for repeating the exact same procedure over and over. Set
the arguments of the function to variables that you might want to change. Here,
we will want to estimate power for:
* different sample sizes (`n`)
* different coin biases (`bias`)
* different hypothesised probabilities (`p`, defaults to 0.5)
```{r sim_binom_test}
sim_binom_test <- function(n, bias, p = 0.5) {
coin <- rbinom(1, n, bias)
btest <- binom.test(coin, n, p)
btest$p.value
}
```
Once you've created your function, test it a few times, changing the values.
```{r sbt-2}
sim_binom_test(100, 0.6)
```
#### Calculate power
Then you can use the `replicate()` function to run it many times and save all
the output values. You can calculate the *power* of your analysis by checking the
proportion of your simulated analyses that have a p-value less than your _alpha_
(the probability of rejecting the null hypothesis when the null hypothesis is true).
```{r replicate}
my_reps <- replicate(1e4, sim_binom_test(100, 0.6))
mean(my_reps < 0.05)
```
```{block, type="info"}
`1e4` is just scientific notation for a 1 followed
by 4 zeros (`10000`). When youre running simulations, you usually want to run a
lot of them and it's a pain to keep track of whether you've typed 5 or 6 zeros
(100000 vs 1000000) and this will change your running time by an order of
magnitude.
```
### Normal Distribution
#### Sample distribution
`rnorm(n, mean, sd)`
We can simulate a normal distribution of size `n` if we know the `mean` and
standard deviation (`sd`). A density plot is usually the best way to visualise
this type of data if your `n` is large.
```{r rnorm}
dv <- rnorm(1e5, 10, 2)
ggplot() +
geom_density(aes(dv), fill = "white") +
geom_vline(xintercept = mean(dv), color = "red") +
geom_vline(xintercept = quantile(dv, .5 - (.6827/2)), color = "darkgreen") +
geom_vline(xintercept = quantile(dv, .5 + (.6827/2)), color = "darkgreen") +
geom_vline(xintercept = quantile(dv, .5 - (.9545/2)), color = "blue") +
geom_vline(xintercept = quantile(dv, .5 + (.9545/2)), color = "blue") +
geom_vline(xintercept = quantile(dv, .5 - (.9973/2)), color = "purple") +
geom_vline(xintercept = quantile(dv, .5 + (.9973/2)), color = "purple") +
scale_x_continuous(
limits = c(0,20),
breaks = seq(0,20)
)
```
```{block, type="info"}
Run the simulation above several times, noting how
the density plot changes. What do the vertical lines represent? Try changing the
values of `n`, `mean`, and `sd`.
```
#### T-test
`t.test(x, y, alternative, mu, paired)`
Use a t-test to compare the mean of one distribution to a null hypothesis
(one-sample t-test), compare the means of two samples (independent-samples t-test),
or compare pairs of values (paired-samples t-test).
You can run a one-sample t-test comparing the mean of your data to `mu`. Here is
a simulated distribution with a mean of 0.5 and an SD of 1, creating an effect
size of 0.5 SD when tested against a `mu` of 0. Run the simulation a few times to
see how often the t-test returns a significant p-value (or run it in the [shiny app](http://shiny.psy.gla.ac.uk/debruine/normsim/)).
```{r sim_norm}
sim_norm <- rnorm(100, 0.5, 1)
t.test(sim_norm, mu = 0)
```
Run an independent-samples t-test by comparing two lists of values.
```{r t-test}
a <- rnorm(100, 0.5, 1)
b <- rnorm(100, 0.7, 1)
t_ind <- t.test(a, b, paired = FALSE)
t_ind
```
```{block, type="warning"}
The `paired` argument defaults to `FALSE`, but it's good practice to always explicitly set it so you are never confused about what type of test you are performing.
```
#### Sampling function
We can use the `names()` function to find out the names of all the t.test parameters
and use this to just get one type of data, like the test statistic (e.g., t-value).
```{r names}
names(t_ind)
t_ind$statistic
```
Alternatively, use `broom::tidy()` to convert the output into a tidy table.
```{r broom}
broom::tidy(t_ind)
```
If you want to run the simulation many times and record information each time,
first you need to turn your simulation into a function.
```{r sim_t_ind}
sim_t_ind <- function(n, m1, sd1, m2, sd2) {
v1 <- rnorm(n, m1, sd1)
v2 <- rnorm(n, m2, sd2)
t_ind <- t.test(v1, v2, paired = FALSE)
return(t_ind$p.value)
}
```
Run it a few times to check that it gives you sensible values.
```{r run-sim_t_ind}
sim_t_ind(100, 0.7, 1, 0.5, 1)
```
Now replicate the simulation 1000 times.
```{r reps}
my_reps <- replicate(1e4, sim_t_ind(100, 0.7, 1, 0.5, 1))
alpha <- 0.05
power <- mean(my_reps < alpha)
power
```
```{block, type="try"}
Run the code above several times. How much does the power value fluctuate? How many replications do you need to run to get a reliable estimate of power?
```
Compare your power estimate from simluation to a power calculation using `power.t.test()`.
Here, `delta` is the difference between `m1` and `m2` above.
```{r power.t.test}
power.t.test(n = 100, delta = 0.2, sd = 1, sig.level = alpha, type = "two.sample")
```
You can plot the distribution of p-values.
```{r plot-reps}
ggplot() +
geom_histogram(
aes(my_reps),
binwidth = 0.05,
boundary = 0,
fill = "white",
color = "black"
)
```
```{block, type="try"}
What do you think the distribution of p-values is
when there is no effect (i.e., the means are identical)? Check this yourself.
```
```{block, type="warning"}
Make sure the `boundary` argument is set to `0` for p-value histograms. See what happens with a null effect if `boundary` is not set.
```
#### Bivariate Normal
##### Correlation
You can test if two continuous variables are related to each other using the `cor()` function.
Below is a quick and dirty way to generate two correlated variables. `x` is drawn
from a normal distribution, while `y` is the sum of `x` and another value drawn
from a random normal distribution. We'll learn later how to generate specific
correlations in simulated data.
```{r cor}
n <- 100 # number of random samples
x <- rnorm(n, 0, 1)
y <- x + rnorm(n, 0, 1)
cor(x, y)
```
`cor()` defaults to Pearson's correlations. Set the `method` argument to use
Kendall or Spearman correlations.
```{r cor-spearman}
cor(x, y, method = "spearman")
```
##### Sample distribution
<a name="bvn"></a>
What if we want to sample from a population with specific relationships between
variables? We can sample from a _bivariate normal distribution_ using the `MASS` package,
```{r bvn}
n <- 1000 # number of random samples
rho <- 0.5 # population correlation between the two variables
mu <- c(10, 20) # the means of the samples
stdevs <- c(5, 6) # the SDs of the samples
# correlation matrix
cor_mat <- matrix(c( 1, rho,
rho, 1), 2)
# create the covariance matrix
sigma <- (stdevs %*% t(stdevs)) * cor_mat
# sample from bivariate normal distribution
bvn <- mvrnorm(n, mu, sigma)
cor(bvn) # check correlation matrix
```
Plot your sampled variables to check everything worked like you expect. You need
to convert the output of `mvnorm` into a tibble in order to use it in ggplot.
```{r graph-bvn}
bvn %>%
as_tibble() %>%
ggplot(aes(V1, V2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
geom_density2d()
```
### Multivariate Normal
##### Sample distribution
```{r mvn}
n <- 200 # number of random samples
rho1_2 <- 0.5 # correlation betwen v1 and v2
rho1_3 <- 0 # correlation betwen v1 and v3
rho2_3 <- 0.7 # correlation betwen v2 and v3
mu <- c(10, 20, 30) # the means of the samples
stdevs <- c(8, 9, 10) # the SDs of the samples
# correlation matrix
cor_mat <- matrix(c( 1, rho1_2, rho1_3,
rho1_2, 1, rho2_3,
rho1_3, rho2_3, 1), 3)
sigma <- (stdevs %*% t(stdevs)) * cor_mat
bvn3 <- mvrnorm(n, mu, sigma)
cor(bvn3) # check correlation matrix
```
##### 3D Plots
You can use the `plotly` library to make a 3D graph.
```{r 3d-graph-mvn}
library(plotly)
marker_style = list(
color = "#ff0000",
line = list(
color = "#444",
width = 1
),
opacity = 0.5,
size = 5
)
bvn3 %>%
as_tibble() %>%
plot_ly(x = ~V1, y = ~V2, z = ~V3, marker = marker_style) %>%
add_markers()
```
## Example
This example uses the [Growth Chart Data Tables](https://www.cdc.gov/growthcharts/data/zscore/zstatage.csv)
from the [US CDC](https://www.cdc.gov/growthcharts/zscore.htm).
### Load & wrangle
We have to do a little data wrangling first. Have a look at the data after you
import it and relabel `Sex` to `male` and `female` instead of `1` and `2`. Also
convert `Agemos` (age in months) to years. Relabel the column `0` as `mean` and
calculate a new column named `sd` as the difference between columns `1` and `0`.
```{r load-height-data}
height_age <- read_csv("https://www.cdc.gov/growthcharts/data/zscore/zstatage.csv") %>%
filter(Sex %in% c(1,2)) %>%
mutate(
sex = recode(Sex, "1" = "male", "2" = "female"),
age = as.numeric(Agemos)/12,
sd = `1` - `0`
) %>%
dplyr::select(sex, age, mean = `0`, sd)
```
```{block, type="warning"}
If you run the code above without putting `dplyr::`
before the `select()` function, you will get an error message. This is because
the `MASS` package also has a function called `select()` and, since we loaded
`MASS` after `tidyverse`, the `MASS` function is the default. When you loaded
`MASS`, you should have seen a warning like "The following object is masked from
‘package:dplyr’: select". You can use functions with the same name from different
packages by specifying the package before the function name, separated by two colons.
```
### Plot
Plot your new data frame to see how mean height changes with age for boys and girls.
```{r plot-height-means}
ggplot(height_age, aes(age, mean, color = sex)) +
geom_smooth(aes(ymin = mean - sd, ymax = mean + sd), stat="identity")
```
### Get means and SDs
Create new variables for the means and SDs for 20-year-old men and women.
```{r height-subset-20}
height_sub <- height_age %>% filter(age == 20)
m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd <- height_sub %>% filter(sex == "female") %>% pull(sd)
height_sub
```
### Simulate a population
Simulate 50 random male heights and 50 radom female heights using the `rnorm()`
function and the means and SDs above. Plot the data.
```{r sim-height-20}
sim_height <- tibble(
male = rnorm(50, m_mean, m_sd),
female = rnorm(50, f_mean, f_sd)
) %>%
gather("sex", "height", male:female)
ggplot(sim_height) +
geom_density(aes(height, fill = sex), alpha = 0.5) +
xlim(125, 225)
```
```{block, type="try"}
Run the simulation above several times, noting how the density plot changes. Try changing the age you're simulating.
```
### Analyse simulated data
Use the `sim_t_ind(n, m1, sd1, m2, sd2)` function we created above to generate
one simulation with a sample size of 50 in each group using the means and SDs
of male and female 14-year-olds.
```{r subset-height-14}
height_sub <- height_age %>% filter(age == 14)
m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd <- height_sub %>% filter(sex == "female") %>% pull(sd)
sim_t_ind(50, m_mean, m_sd, f_mean, f_sd)
```
### Replicate simulation
Now replicate this 1e4 times using the `replicate()` function. This function
will save the returned p-values in a list (`my_reps`). We can then check what
proportion of those p-values are less than our alpha value. This is the power of
our test.
```{r rep-height-14}
my_reps <- replicate(1e4, sim_t_ind(50, m_mean, m_sd, f_mean, f_sd))
alpha <- 0.05
power <- mean(my_reps < alpha)
power
```
### One-tailed prediction
This design has about 65% power to detect the sex difference in height (with a
2-tailed test). Modify the `sim_t_ind` function for a 1-tailed prediction.
You could just set `alternative` equal to "greater" in the function, but it might be
better to add the `alternative` argument to your function (giving it the same default
value as `t.test`) and change the value of `alternative` in the function to `alternative`.
```{r add-1tailed}
sim_t_ind <- function(n, m1, sd1, m2, sd2, alternative = "two.sided") {
v1 <- rnorm(n, m1, sd1)
v2 <- rnorm(n, m2, sd2)
t_ind <- t.test(v1, v2, paired = FALSE, alternative = alternative)
return(t_ind$p.value)
}
my_reps <- replicate(1e4, sim_t_ind(50, m_mean, m_sd, f_mean, f_sd, "greater"))
mean(my_reps < alpha)
```
### Range of sample sizes
What if we want to find out what sample size will give us 80% power? We can try
trial and error. We know the number should be slightly larger than 50. But you
can search more systematically by repeating your power calculation for a range
of sample sizes.
```{block, type="info"}
This might seem like overkill for a t-test, where you can easily look up sample size calculators online, but it is a valuable skill to learn for when your analyses become more complicated.
```
Start with a relatively low number of replications and/or more spread-out samples
to estimate where you should be looking more specifically. Then you can repeat
with a narrower/denser range of sample sizes and more iterations.
```{r range-sample-sizes}
alpha <- 0.05
power_table <- tibble(
n = seq(20, 100, by = 5)
) %>%
mutate(power = map_dbl(n, function(n) {
ps <- replicate(1e3, sim_t_ind(n, m_mean, m_sd, f_mean, f_sd, "greater"))
mean(ps < alpha)
}))
ggplot(power_table, aes(n, power)) +
geom_smooth() +
geom_point() +
geom_hline(yintercept = 0.8)
```
Now we can narrow down our search to values around 55 (plus or minus 5) and
increase the number of replications from 1e3 to 1e4.
```{r narrow-range-sample-sizes}
power_table <- tibble(
n = seq(50, 60)
) %>%
mutate(power = map_dbl(n, function(n) {
ps <- replicate(1e3, sim_t_ind(n, m_mean, m_sd, f_mean, f_sd, "greater"))
mean(ps < alpha)
}))
##ggplot(power_table, aes(n, power)) +
## geom_smooth() +
## geom_point() +
## geom_hline(yintercept = 0.8) +
## scale_x_continuous(breaks = sample_size)
```
## Formative exercises
Download the [formative exercises](formative_exercises/08_simulations_stub.Rmd). See the [answers](formative_exercises/08_simulations_answers.Rmd) only after you've attempted all the questions.