-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtibble_intro_1.Rmd
113 lines (89 loc) · 3.33 KB
/
tibble_intro_1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "tibble"
author: "KevinO'Brien"
date: "22 September 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tibble)
```
## Tibbles
Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).
#### Creating
``tibble()`` is a nice way to create data frames. It encapsulates best practices for data frames:
It never changes an input's type (i.e., no more stringsAsFactors = FALSE!).
```{r}
tibble(x = letters)
#> # A tibble: 26 x 1
#> x
#> <chr>
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> # ... with 22 more rows
```
This makes it easier to use with list-columns:
```{r}
tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
#> # A tibble: 3 x 2
#> x y
#> <int> <list>
#> 1 1 <int [5]>
#> 2 2 <int [10]>
#> 3 3 <int [20]>
```
List-columns are most commonly created by do(), but they can be useful to create by hand.
It never adjusts the names of variables:
names(data.frame(`crazy name` = 1))
#> [1] "crazy.name"
names(tibble(`crazy name` = 1))
#> [1] "crazy name"
It evaluates its arguments lazily and sequentially:
```{r}
tibble(x = 1:5, y = x ^ 2)
#> # A tibble: 5 x 2
#> x y
#> <int> <dbl>
#> 1 1 1.00
#> 2 2 4.00
#> 3 3 9.00
#> 4 4 16.0
#> # ... with 1 more row
```
It never uses row.names(). The whole point of tidy data is to store variables in a consistent way. So it never stores a variable as special attribute.
It only recycles vectors of length 1. This is because recycling vectors of greater lengths is a frequent source of bugs.
#### Coercion
To complement tibble(), tibble provides as_tibble() to coerce objects into tibbles. Generally, as_tibble() methods are much simpler than as.data.frame() methods, and in fact, it's precisely what as.data.frame() does, but it's similar to do.call(cbind, lapply(x, data.frame)) - i.e. it coerces each component to a data frame and then cbinds() them all together.
``as_tibble()`` has been written with an eye for performance:
```{r}
if (requireNamespace("microbenchmark", quiet = TRUE)) {
l <- replicate(26, sample(100), simplify = FALSE)
names(l) <- letters
microbenchmark::microbenchmark(
as_tibble(l),
as.data.frame(l)
)
}
#> Loading required namespace: microbenchmark
```
The speed of as.data.frame() is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.
#### Tibbles vs data frames
There are three key differences between tibbles and data frames: printing, subsetting, and recycling rules.
#### Printing
When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type:
```{r}
tibble(x = 1:1000)
#> # A tibble: 1,000 x 1
#> x
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> # ... with 996 more rows
```
You can control the default appearance with options:
options(tibble.print_max = n, tibble.print_min = m): if there are more than n rows, print only the first m rows. Use options(tibble.print_max = Inf) to always show all rows.
options(tibble.width = Inf) will always print all columns, regardless of the width of the screen.