forked from ScPoEcon/ScPoEconometrics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-SummaryStats.Rmd
318 lines (225 loc) · 13.3 KB
/
03-SummaryStats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# Working With Data {#sum}
In this chapter we will first learn some basic concepts that help summarizing data. Then, we will tackle a real-world task and read, clean, and summarize data from the web.
## Summary Statistics
`R` has built in functions for a large number of summary statistics. For numeric variables, we can summarize data by looking at their center and spread, for example. *Make sure to have loaded the `ggplot2` library to be able to access the `mpg` dataset as introduced in section \@ref(dataframes).*
```{r}
library(ggplot2)
```
### Central Tendency {-}
Suppose we want to know the *mean* and *median* of all the values stored in the `data.frame` column `mpg$cty`:
| Measure | `R` | Result |
|---------|-------------------|---------------------|
| Mean | `mean(mpg$cty)` | `r mean(mpg$cty)` |
| Median | `median(mpg$cty)` | `r median(mpg$cty)` |
### Spread {-}
How do the values in that column *vary*? How far *spread out* are they?
| Measure | `R` | Result |
|--------------------|------------------|--------------------|
| Variance | `var(mpg$cty)` | `r var(mpg$cty)` |
| Standard Deviation | `sd(mpg$cty)` | `r sd(mpg$cty)` |
| IQR | `IQR(mpg$cty)` | `r IQR(mpg$cty)` |
| Minimum | `min(mpg$cty)` | `r min(mpg$cty)` |
| Maximum | `max(mpg$cty)` | `r max(mpg$cty)` |
| Range | `range(mpg$cty)` | `r range(mpg$cty)` |
### Categorical {-}
For categorical variables, counts and percentages can be used for summary.
```{r}
table(mpg$drv)
table(mpg$drv) / nrow(mpg)
```
## Plotting
Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks will be to visualize it. Often, a proper visualization can illuminate features of the data that can inform further analysis.
We will look at four methods of visualizing data by using the basic `plot` facilities built-in with `R`:
- Histograms
- Barplots
- Boxplots
- Scatterplots
### Histograms
When visualizing a single numerical variable, a **histogram** is useful. It summarizes the *distribution* of values in a vector. In `R` you create one using the `hist()` function:
```{r}
hist(mpg$cty)
```
The histogram function has a number of parameters which can be changed to make our plot look much nicer. Use the `?` operator to read the documentation for the `hist()` to see a full list of these parameters.
```{r}
hist(mpg$cty,
xlab = "Miles Per Gallon (City)",
main = "Histogram of MPG (City)", # main title
breaks = 12, # how many breaks?
col = "red",
border = "blue")
```
Importantly, you should always be sure to label your axes and give the plot a title. The argument `breaks` is specific to `hist()`. Entering an integer will give a suggestion to `R` for how many bars to use for the histogram. By default `R` will attempt to intelligently guess a good number of `breaks`, but as we can see here, it is sometimes useful to modify this yourself.
### Barplots
Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.
```{r}
barplot(table(mpg$drv))
```
```{r}
barplot(table(mpg$drv),
xlab = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
ylab = "Frequency",
main = "Drivetrains",
col = "dodgerblue",
border = "darkorange")
```
### Boxplots
To visualize the relationship between a numerical and categorical variable, once could use a **boxplot**. In the `mpg` dataset, the `drv` variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.
```{r}
unique(mpg$drv)
```
First note that we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. To do so in `R`, we use the `boxplot()` function. The box shows the *interquartile range*, the solid line in the middle is the value of the median, the wiskers show 1.5 times the interquartile range, and the dots are outliers.
```{r}
boxplot(mpg$hwy)
```
However, more often we will use boxplots to compare a numerical variable for different values of a categorical variable.
```{r}
boxplot(hwy ~ drv, data = mpg)
```
Here used the `boxplot()` command to create side-by-side boxplots. However, since we are now dealing with two variables, the syntax has changed. The `R` syntax `hwy ~ drv, data = mpg` reads "Plot the `hwy` variable against the `drv` variable using the dataset `mpg`." We see the use of a `~` (which specifies a formula) and also a `data = ` argument. This will be a syntax that is common to many functions we will use in this course.
```{r}
boxplot(hwy ~ drv, data = mpg,
xlab = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
ylab = "Miles Per Gallon (Highway)",
main = "MPG (Highway) vs Drivetrain",
pch = 20,
cex = 2,
col = "darkorange",
border = "dodgerblue")
```
Again, `boxplot()` has a number of additional arguments which have the ability to make our plot more visually appealing.
### Scatterplots
Lastly, to visualize the relationship between two numeric variables we will use a **scatterplot**. This can be done with the `plot()` function and the `~` syntax we just used with a boxplot. (The function `plot()` can also be used more generally; see the documentation for details.)
```{r}
plot(hwy ~ displ, data = mpg)
```
```{r}
plot(hwy ~ displ, data = mpg,
xlab = "Engine Displacement (in Liters)",
ylab = "Miles Per Gallon (Highway)",
main = "MPG (Highway) vs Engine Displacement",
pch = 20,
cex = 2,
col = "dodgerblue")
```
### `ggplot` {#ggplot}
All of the above plots could also have been generated using the `ggplot` function from the already loaded `ggplot2` package. Which function you use is up to you, but sometimes a plot is easier to build in base R (like in the `boxplot` example maybe), sometimes the other way around.
```{r}
ggplot(data = mpg,mapping = aes(x=displ,y=hwy)) + geom_point()
```
`ggplot` is impossible to describe in brief terms, so please look at [the package's website](http://ggplot2.tidyverse.org) which provides excellent guidance. We will from time to time use ggplot in this book, so try to familiarize yourself with it. Let's quickly demonstrate how one could customize that first plot:
```{r}
ggplot(data = mpg, mapping = aes(x=displ,y=hwy)) + # ggplot() makes base plot
geom_point(color="blue",size=2) + # how to show x and y?
scale_y_continuous(name="Miles Per Gallon (Highway)") + # name of y axis
scale_x_continuous(name="Engine Displacement (in Liters)") + # x axis
theme_bw() + # change the background
ggtitle("MPG (Highway) vs Engine Displacement") # add a title
```
## The `tidyverse`
[Hadley Wickham](http://hadley.nz) is the author of R packages `ggplot2` and also of `dplyr` (and also a myriad of others). With `ggplot2` he pioneered what he calls the *grammar of graphics* (hence, `gg`). Grammar in the sense that there are **nouns** and **verbs** and a **syntax**, i.e. rules of how nouns and verbs are to be put together to construct an understandable sentence. He has extended the *grammar* idea into various other packages. The `tidyverse` package is a collection of those packages.
`tidy` data is data where:
* Each variable is a column
* Each observation is a row
* Each value is a cell
Fair enough, you might say, that is a regular spreadsheet. And you are right! However, data comes to us *not* tidy most of the times, and we first need to clean, or `tidy`, it up. Once it's in `tidy` format, we can use the tools in the `tidyverse` with great efficiency to analyse the data and stop worrying about which tool to use.
## Example: Importing Excel Data
The data we will look at is from [Eurostat](http://ec.europa.eu/eurostat/data/database) on demography and migration. You should download the data yourself (click on previous link, then drill down to *database by themes > Population and social conditions > Demograph and migration > Population change - Demographic balance and crude rates at national level (demo_gind)*).
Once downloaded, we can read the data with the function `read_excel` from the package [`readxl`](http://readxl.tidyverse.org), again part of the `tidyverse` suite.
It's important to know how the data is organized in the spreadsheet. Open the file with Excel to see:
* There is a heading which we don't need.
* There are 5 rows with info that we don't need.
* There is one table per variable (total population, males, females, etc)
* Each table has one row for each country, and one column for each year.
* As such, this data is **not tidy**.
Now we will read the first chunk of data, from the first table: *total population*:
```{r}
library(readxl) # load the library
# Suppose I downloaded the file to a directory
# data/ in my current working directory (cwd)
# you can change your cwd with `setwd("your/directory")`
# or in RStudio by clicking Session > Set Working Directory
# total population in raw format
tot_pop_raw = read_excel(path = "data/demo_gind.xls",
sheet="Data", # which sheet
range="A9:K68") # which excel cell range to read
names(tot_pop_raw)[1] <- "Country" # lets rename the first column
tot_pop_raw
```
This shows a `tibble`. This is more informative view of a `data.frame` (it tells us data type, and how many cols and rows). Notice, in particular, that *all* columns seem to be of type `chr`, i.e. characters - a string, not a number! We'll have to fix that, as this is clearly numeric data.
### `tidyr`
In the previous `tibble`, each year is a column name (like `2008`) instead of all years being in one column `year`. We really would like to have several rows for each Country, one per year. We want to `gather()` all years into a new column to tidy this up - and here is how:
1. specify which columns are to be gathered: in our case, all years (note that `paste(2008:2017)` produces a vector like `["2008", "2009", "2010",...]`)
1. say what those columns should be gathered into, i.e. what is the *key* for those values: `year`.
1. Finally, what is the name of the new resulting column, containing the *value* from each cell: `counts`.
```{r gather,warning=FALSE}
library(tidyr) # for the gather function
tot_pop = gather(tot_pop_raw, paste(2008:2017),key="year", value = "counts")
```
Notice that `counts` is still a `chr`. Let's convert it:
```{r convert}
tot_pop$counts = as.integer(tot_pop$counts)
tot_pop
```
Now you can see that column `counts` is indeed a `int`, i.e. an integer number, and we are fine.
### `dplyr`
>The [transform](http://r4ds.had.co.nz/transform.html) chapter of Hadley Wickham's book is a great place to read up more on using `dplyr`.
With `dplyr` you can do the following operations on `data.frame`s and `tibble`s:
* Choose observations based on a certain value: `filter()`
* Reorder rows: `arrange()`
* Select variables by name: `select()`
* Create new variables out of existing ones: `mutate()`
* Summarise variables: `summarise()`
All of those verbs can be used with `group_by()`, where we apply the respective operation on a *group* of the dataframe/tibble. For example, on our `tot_pop` tibble we will now
* filter
* mutate
* and plot the resulting values
Now we can immediately select some countries and plot their population over time. We will make use of the `piping` syntax of `dplyr` as already mentioned in section \@ref(dataframes).
```{r gather-plot,warning=FALSE,message=FALSE}
library(dplyr) # for %>%, filter, mutate, ...
tot_pop %>%
filter(Country %in% c("France","United Kingdom","Italy","Germany")) %>%
mutate(millions = counts / 1e6) %>%
ggplot(mapping = aes(x=year,y=millions,color=Country,group=Country)) + geom_line()
```
#### Arrange a `tibble` {-}
* What are the top/bottom 5 most populated areas?
```{r,message=FALSE}
top5 = tot_pop %>%
arrange(desc(counts)) %>% # arrange in descending order of col `counts`
top_n(5)
bottom5 = tot_pop %>%
arrange(desc(counts)) %>%
top_n(-5)
top5
bottom5
```
Now this is not exactly what we wanted. It's always the same country in both top and bottom, because there are multiple years per country. Let's compute average population over the last 5 years and rank according to that:
```{r,message=FALSE}
topbottom = tot_pop %>%
group_by(Country) %>%
filter(year > 2012) %>%
summarise(mean_count = mean(counts)) %>%
arrange(desc(mean_count))
top5 = topbottom %>% top_n(5)
bottom5 = topbottom %>% top_n(-5)
top5
bottom5
```
That's better!
#### Look for `NA`s in a `tibble` {-}
Sometimes data is *missing*, and `R` represents it with the special value `NA` (not available). It is good to know where in our dataset we are going to encounter any missing values, so the task here is: let's produce a table that has three columns:
1. the names of countries with missing data
2. how many years of data are missing for each of those
3. and the actual years that are missing
```{r}
missings = tot_pop %>%
filter(is.na(counts)) %>% # is.na(x) returns TRUE if x is NA
group_by(Country) %>%
summarise(n_missing = n(),years = paste(year,collapse = ", "))
knitr:::kable(missings)
```
Notice that I used the function `kable` from the `knitr` package to produce a nicely looking table here.
* plot different vars
* make proportions and plot
* make growth rates and plot
* plot different vars by country