forked from PsyTeachR/msc-data-skills
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-dplyr.Rmd
699 lines (522 loc) · 20.2 KB
/
05-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
# Data Wrangling {#dplyr}
## Learning Objectives
### Basic
1. Be able to use the 6 main dplyr one-table verbs:
+ [`select()`](#select)
+ [`filter()`](#filter)
+ [`arrange()`](#arrange)
+ [`mutate()`](#mutate)
+ [`summarise()`](#summarise)
+ [`group_by()`](#group_by)
### Intermediate
2. Also know these additional one-table verbs:
+ [`rename()`](#rename)
+ [`distinct()`](#distinct)
+ [`count()`](#count)
+ [`slice()`](#slice)
+ [`pull()`](#pull)
### Advanced
3. Fine control of [`select()` operations](#select_helpers)
4. Perform 'windowed' operations
+ windowed `mutate()`
+ windowed `slice()`
## Resources
* [Chapter 5: Data Transformation](http://r4ds.had.co.nz/transform.html) in *R for Data Science*
* [Data transformation cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf)
* [Lecture slides on dplyr one-table verbs](slides/04_dplyr_slides.pdf)
* [Chapter 16: Date and times](http://r4ds.had.co.nz/dates-and-times.html) in *R for Data Science*
## Setup
You'll need the following packages.
```{r, results = 'hide', warning = FALSE, message = FALSE}
# libraries needed for these examples
library(lubridate)
library(tidyverse)
library(ukbabynames)
```
## The `disgust` dataset
These examples will use data from [disgust.csv](/data/disgust.csv), which contains data from the [Three Domain Disgust Scale](http://digitalrepository.unm.edu/cgi/viewcontent.cgi?article=1139&context=psy_etds). Each participant is identified by a unique `user_id` and each questionnaire completion has a unique `id`.
```{r, message = FALSE}
disgust <- read_csv("https://gupsych.github.io/data_skills/data/disgust.csv")
```
*Questionnaire Instructions*: The following items describe a variety of concepts. Please rate how disgusting you find the concepts described in the items, where 0 means that you do not find the concept disgusting at all, and 6 means that you find the concept extremely disgusting.
| colname | question |
|----------:|:----------------------------------------------------------------------------------|
| moral1 | Shoplifting a candy bar from a convenience store |
| moral2 | Stealing from a neighbor |
| moral3 | A student cheating to get good grades |
| moral4 | Deceiving a friend |
| moral5 | Forging someone's signature on a legal document |
| moral6 | Cutting to the front of a line to purchase the last few tickets to a show |
| moral7 | Intentionally lying during a business transaction |
| sexual1 | Hearing two strangers having sex |
| sexual2 | Performing oral sex |
| sexual3 | Watching a pornographic video |
| sexual4 | Finding out that someone you don't like has sexual fantasies about you |
| sexual5 | Bringing someone you just met back to your room to have sex |
| sexual6 | A stranger of the opposite sex intentionally rubbing your thigh in an elevator |
| sexual7 | Having anal sex with someone of the opposite sex |
| pathogen1 | Stepping on dog poop |
| pathogen2 | Sitting next to someone who has red sores on their arm |
| pathogen3 | Shaking hands with a stranger who has sweaty palms |
| pathogen4 | Seeing some mold on old leftovers in your refrigerator |
| pathogen5 | Standing close to a person who has body odor |
| pathogen6 | Seeing a cockroach run across the floor |
| pathogen7 | Accidentally touching a person's bloody cut |
## Six main dplyr verbs
Most of the data wrangling you'll want to do with psychological data will involve the `tidyr` verbs you learned in [Chapter 3](#tidyr) and the six main `dplyr` verbs: `select`, `filter`, `arrange`, `mutate`, `summarise`, and `group_by`.
### select() {#select}
Select columns by name or number.
You can select each column individually, separated by commas (e.g., `col1, col2`). You can also select all columns between two columns by separating them with a colon (e.g., `start_col:end_col`).
```{r}
moral <- disgust %>% select(user_id, moral1:moral7)
names(moral)
```
You can select columns by number, which is useful when the column names are long or complicated.
```{r}
sexual <- disgust %>% select(2, 11:17)
names(sexual)
```
You can use a minus symbol to unselect columns, leaving all of the other columns. If you want to exclude a span of columns, put parentheses around the span first (e.g., `-(moral1:moral7)`, not `-moral1:moral7`).
```{r}
pathogen <- disgust %>% select(-id, -date, -(moral1:sexual7))
names(pathogen)
```
You can select columns based on criteria about the column names.{#select_helpers}
#### `starts_with()` {#starts_with}
Select columns that start with a character string.
```{r}
u <- disgust %>% select(starts_with("u"))
names(u)
```
#### `ends_with()` {#ends_with}
Select columns that end with a character string.
```{r}
firstq <- disgust %>% select(ends_with("1"))
names(firstq)
```
#### `contains()` {#contains}
Select columns that contain a character string.
```{r}
pathogen <- disgust %>% select(contains("pathogen"))
names(pathogen)
```
#### `num_range()` {#num_range}
Select columns with a name that matches the pattern `prefix`.
```{r}
moral2_4 <- disgust %>% select(num_range("moral", 2:4))
names(moral2_4)
```
```{block, type="info"}
Use `width` to set the number of digits with leading
zeros. For example, `num_range('var_', 8:10, width=2)` selects columns `var_08`, `var_09`, and `var_10`.
```
### filter() {#filter}
Select rows by matching column criteria.
Select all rows where the user_id is 1 (that's Lisa).
```{r}
disgust %>% filter(user_id == 1)
```
```{block, type="warning"}
Remember to use `==` and not `=` to check if two things are equivalent. A single `=` assigns the righthand value to the lefthand variable and (usually) evaluates to `TRUE`.
```
You can select on multiple criteria by separating them with commas.
```{r}
amoral <- disgust %>% filter(
moral1 == 0,
moral2 == 0,
moral3 == 0,
moral4 == 0,
moral5 == 0,
moral6 == 0,
moral7 == 0
)
```
You can use the symbols `&`, `|`, and `!` to mean "and", "or", and "not". You can also use other operators to make equations.
```{r}
# everyone who chose either 0 or 7 for question moral1
moral_extremes <- disgust %>%
filter(moral1 == 0 | moral1 == 7)
# everyone who chose the same answer for all moral questions
moral_consistent <- disgust %>%
filter(
moral2 == moral1 &
moral3 == moral1 &
moral4 == moral1 &
moral5 == moral1 &
moral6 == moral1 &
moral7 == moral1
)
# everyone who did not answer 7 for all 7 moral questions
moral_no_ceiling <- disgust %>%
filter(moral1+moral2+moral3+moral4+moral5+moral6+moral7 != 7*7)
```
Sometimes you need to exclude some participant IDs for reasons that can't be described in code. the `%in%` operator is useful here for testing if a column value is in a list. Surround the equation with parentheses and put `!` in front to test that a value is not in the list.
```{r}
no_researchers <- disgust %>%
filter(!(user_id %in% c(1,2)))
```
#### Dates {#dates}
You can use the `lubridate` package to work with dates. For example, you can use the `year()` function to return just the year from the `date` column and then select only data collected in 2010.
```{r}
disgust2010 <- disgust %>%
filter(year(date) == 2010)
```
Or select data from at least 5 years ago. You can use the `range` function to check the minimum and maxiumum dates in the resulting dataset.
```{r}
disgust_5ago <- disgust %>%
filter(date < today() - dyears(5))
range(disgust_5ago$date)
```
### arrange() {#arrange}
Sort your dataset using `arrange()`.
```{r}
disgust_order <- disgust %>%
arrange(id)
head(disgust_order)
```
Reverse the order using `desc()`
```{r}
disgust_order <- disgust %>%
arrange(desc(id))
head(disgust_order)
```
### mutate() {#mutate}
Add new columns. This is one of the most useful functions in the tidyverse.
Refer to other columns by their names (unquoted). You can add more than one column, just separate the columns with a comma. Once you make a new column, you can use it in further column definitions e.g., `total` below).
```{r}
disgust_total <- disgust %>%
mutate(
pathogen = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7,
moral = moral1 + moral2 + moral3 + moral4 + moral5 + moral6 + moral7,
sexual = sexual1 + sexual2 + sexual3 + sexual4 + sexual5 + sexual6 + sexual7,
total = pathogen + moral + sexual,
user_id = paste0("U", user_id)
)
```
```{block, type="warning"}
You can overwrite a column by giving a new column
the same name as the old column. Make sure that you mean to do this and that you aren't trying to use the old column value after you redefine it.
```
### summarise() {#summarise}
Create summary statistics for the dataset. Check the [Data Wrangling Cheat Sheet](https://www.rstudio.org/links/data_wrangling_cheat_sheet) or the [Data Transformation Cheat Sheet](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf) for various summary functions. Some common ones are: `mean()`, `sd()`, `n()`, `sum()`, and `quantile()`.
```{r}
disgust_total %>%
summarise(
n = n(),
q25 = quantile(total, .25, na.rm = TRUE),
q50 = quantile(total, .50, na.rm = TRUE),
q75 = quantile(total, .75, na.rm = TRUE),
avg_total = mean(total, na.rm = TRUE),
sd_total = sd(total, na.rm = TRUE),
min_total = min(total, na.rm = TRUE),
max_total = max(total, na.rm = TRUE)
)
```
### group_by() {#group_by}
Create subsets of the data. You can use this to create summaries,
like the mean value for all of your experimental groups.
Here, we'll use `mutate` to create a new column called `year`, group by `year`, and calculate the average scores.
```{r}
disgust_total %>%
mutate(year = year(date)) %>%
group_by(year) %>%
summarise(
n = n(),
avg_total = mean(total, na.rm = TRUE),
sd_total = sd(total, na.rm = TRUE),
min_total = min(total, na.rm = TRUE),
max_total = max(total, na.rm = TRUE)
)
```
You can use `filter` after `group_by`. The following example returns the lowest total score from each year.
```{r}
disgust_total %>%
mutate(year = year(date)) %>%
select(user_id, year, total) %>%
group_by(year) %>%
filter(rank(total) == 1) %>%
arrange(year)
```
You can also use `mutate` after `group_by`. The following example calculates subject-mean-centered scores by grouping the scores by `user_id` and then subtracting the group-specific mean from each score. <span class="text-warning">Note the use of `gather` to tidy the data into a long format first.</span>
```{r}
disgust_smc <- disgust %>%
gather("question", "score", moral1:pathogen7) %>%
group_by(user_id) %>%
mutate(score_smc = score - mean(score, na.rm = TRUE))
```
### All Together
A lot of what we did above would be easier if the data were tidy, so let's do that first. Then we can use `group_by` to calculate the domain scores.
```{block, type="warning"}
It is good practice to use `ungroup()` after using `group_by` and `summarise`. Forgetting to ungroup the dataset won't affect some further processing, but can really mess up other things.
```
Then we can spread out the 3 domains, calculate the total score, remove any rows with a missing (`NA`) total, and calculate mean values by year.
```{r}
disgust_tidy <- read_csv("data/disgust.csv") %>%
gather("question", "score", moral1:pathogen7) %>%
separate(question, c("domain","q_num"), sep = -1) %>%
group_by(id, user_id, date, domain) %>%
summarise(score = mean(score)) %>%
ungroup()
disgust_tidy2 <- disgust_tidy %>%
spread(domain, score) %>%
mutate(
total = moral + sexual + pathogen,
year = year(date)
) %>%
filter(!is.na(total)) %>%
arrange(user_id)
disgust_tidy3 <- disgust_tidy2 %>%
group_by(year) %>%
summarise(
n = n(),
avg_pathogen = mean(pathogen),
avg_moral = mean(moral),
avg_sexual = mean(sexual),
first_user = first(user_id),
last_user = last(user_id)
)
disgust_tidy3
```
## Additional dplyr one-table verbs
### rename() {#rename}
### distinct() {#distinct}
### count() {#count}
### slice() {#slide}
### pull() {#pull}
## Exercises
Download the [formative exercises](formative_exercises/05_wrangle2_stub.Rmd). See the [answers](formative_exercises/05_wrangle2_answers.Rmd) only after you've attempted all the questions.
The following exercises use a dataset included in the `ukbabynames` package. Might as well convert to a tibble to make printing prettier.
```{r ukbn}
## convert to a tibble
ukb <- as_tibble(ukbabynames)
```
### Easy
1. How many records are in the dataset?
`r webex::hide()`
```{r}
count(ukb) ## or: nrow(ukb)
```
`r webex::unhide()`
2. Remove the column `rank` from the dataset.
`r webex::hide()`
```{r}
ukb %>%
select(-rank)
```
`r webex::unhide()`
3. What is the range of birth years contained in the dataset?
`r webex::hide()`
```{r}
ukb %>%
summarise(minyear = min(year),
maxyear = max(year))
```
`r webex::unhide()`
4. Pull out the babies named Hermione.
`r webex::hide()`
```{r}
ukb %>%
filter(name == "Hermione")
```
`r webex::unhide()`
5. Sort the dataset by sex and then by year (descending) and then by rank (descending).
`r webex::hide()`
```{r}
ukb %>%
arrange(sex, desc(year), desc(rank))
```
`r webex::unhide()`
6. Create a new column, `decade`, that contains the decade of birth (1990, 2000, 2010). Hint: see `?floor`
`r webex::hide()`
```{r}
ukb %>%
mutate(decade = floor(year / 10) * 10)
```
`r webex::unhide()`
7. Pull out the male babies named Courtney that were born between 1998 and 2001 (inclusive).
`r webex::hide()`
```{r}
ukb %>%
filter(name == "Courtney", sex == "M",
year >= 1998, year <= 2001)
```
`r webex::unhide()`
8. How many distinct names are represented in the dataset?
`r webex::hide()`
```{r}
## how many distinct names are represented in the dataset?
ukb %>%
distinct(name) %>%
count()
```
`r webex::unhide()`
9. Pull out all the female babies named Frankie that were born before 1999 or after 2010
`r webex::hide()`
```{r ex1}
ukb %>%
filter(name == "Frankie", sex == "F",
(year < 1999) | (year > 2010))
```
`r webex::unhide()`
10. How many total babies in the dataset were named 'Emily'?
`r webex::hide()`
```{r ex2}
##
ukb %>%
filter(name == "Emily") %>%
summarise(tot = sum(n)) %>%
pull(tot)
```
`r webex::unhide()`
11. How many distinct names are there for each sex?
`r webex::hide()`
```{r ex3}
ukb %>%
group_by(sex) %>%
distinct(name) %>%
count()
```
`r webex::unhide()`
12. What is the most popular name in the dataset?
`r webex::hide()`
```{r ex4}
ukb %>%
group_by(name) %>%
summarise(tot = sum(n)) %>%
arrange(desc(tot)) %>%
slice(1) %>%
pull(name)
```
`r webex::unhide()`
13. How many babies were born each year for each sex? Make a plot.
`r webex::hide()`
```{r ex5}
babes_per_year <- ukb %>%
group_by(year, sex) %>%
summarise(tot = sum(n))
ggplot(babes_per_year, aes(year, tot, color = sex)) +
geom_line()
```
`r webex::unhide()`
### Intermediate
- For the next questions, you might first want to read about window functions (`group_by()` + `filter()` or `mutate()` in the dplyr vignette: `vignette("window-functions")`.
1. Create a column `prop` that contains the proportion of babies that were given a particular name for a given sex in a given year, then `ungroup()` the resulting table.
`r webex::hide()`
```{r ex6}
ukb_prop <- ukb %>%
group_by(year, sex) %>%
mutate(p = n / sum(n)) %>%
ungroup()
## TODO double check that you did it right by making sure the props
## for each year/sex combo sum to 1
```
`r webex::unhide()`
2. Use a window function to pull out the top year for each name/sex combination in the table you just created (i.e., the year when the name was given to greatest proportion of babies of a given sex).
`r webex::hide()`
```{r ex7}
ukb_top_year <- ukb_prop %>%
group_by(name, sex) %>%
filter(p == max(p)) %>%
ungroup() %>%
arrange(year)
```
`r webex::unhide()`
### Advanced
1. Make a frequency histogram for the final letter of each name, broken down by sex. Are certain final letters more "gendered" than others?
`r webex::hide()`
```{r ex1-adv}
last_letter <- ukb %>%
mutate(lastchar = substr(name, nchar(name), nchar(name))) %>%
filter(lastchar %in% letters) %>%
count(sex, lastchar) %>%
arrange(lastchar)
ggplot(last_letter, aes(lastchar, n, fill = sex)) +
geom_bar(stat = "identity")
```
`r webex::unhide()`
2. Calculate the top 5 boys and girls names for each decade and print out the whole table.
`r webex::hide()`
```{r ex2-adv}
ukb %>%
mutate(decade = floor(year / 10) * 10) %>%
group_by(decade, sex, name) %>%
summarise(tot_n = sum(n)) %>%
arrange(desc(tot_n)) %>%
slice(1:5) %>%
ungroup() %>%
print(n = +Inf)
```
`r webex::unhide()`
3. Pull out the top 5 boys and girls names for the final year of the dataset, and plot the historical trajectory of their popularity, with separate graphs for boys and girls. (Hint: This might require merging tables, which you will learn about next week.)
`r webex::hide()`
```{r ex3-adv}
top_names <- ukb %>%
filter(year == max(year), rank <= 5) %>%
select(sex, name)
ukb %>%
inner_join(top_names, c("sex", "name")) %>%
ggplot(aes(year, n, color = name)) +
geom_line() +
facet_wrap(~sex)
```
`r webex::unhide()`
4. What are the 10 most androgynous names in the UK? Discard any names that were given to less than 5000 babies total. Calculate an "androgyny index" for each name as log(F + .5) - log(M + .5) where F is the number of female and M is the number of male babies. This index will be zero for perfect gender balance, positive for skewed female, and negative for skewed male.
`r webex::hide()`
```{r ex4-adv}
andro <- ukb %>%
group_by(sex, name) %>%
summarise(tot = sum(n)) %>%
ungroup() %>%
spread(sex, tot, fill = 0) %>%
mutate(N = F + M)
andro_gbal <- andro %>%
filter(N >= 5000) %>%
mutate(gbal = log(F + .5) - log(M + .5),
tot = F + M) %>%
arrange(abs(gbal)) %>%
slice(1:10)
```
`r webex::unhide()`
5. Which girl name has increased the most in popularity, if you directly compare the first year of the dataset to the last year? Which girl name has decreased the most? (Only comare names that were given to at least 500 babies in at least one of the years covered by the dataset.)
`r webex::hide()`
```{r ex5-adv}
name_pop <- ukb %>%
filter(sex == "F", (year == 1996) | (year == 2015)) %>%
spread(year, n, fill = 0) %>%
filter(`2015` > 500 | `1996` > 500) %>%
mutate(chg = `2015` - `1996`)
name_pop %>%
arrange(desc(chg)) %>%
slice(1) %>%
pull(name)
name_pop %>%
arrange(chg) %>%
slice(1) %>%
pull(name)
```
`r webex::unhide()`
6. Calculate the proportion of names that are androgynous for each year in the dataset (were given to both male and female babies) and then plot the historical trend.
`r webex::hide()`
```{r ex6-adv}
p_andro <- ukb %>%
select(-rank) %>%
spread(sex, n, fill = 0) %>%
mutate(is_andro = (F != 0) & (M != 0)) %>%
group_by(year) %>%
summarise(p = mean(is_andro))
ggplot(p_andro, aes(year, p)) + geom_line()
```
`r webex::unhide()`
7. *Naming diversity trends.* Calculate a naming diversity index (number of names divided by number of babies) for each year and sex in the dataset. Plot the historical trend for naming diversity.
`r webex::hide()`
```{r ex7-adv}
ndiversity <- ukb %>%
group_by(year, sex) %>%
summarise(n_names = n_distinct(name),
n_babies = sum(n),
d_index = n_names / n_babies) %>%
ungroup()
ggplot(ndiversity,
aes(year, d_index, color = sex)) + geom_line()
```
`r webex::unhide()`