-
Notifications
You must be signed in to change notification settings - Fork 77
/
22-sql_essentials.qmd
422 lines (340 loc) · 13.7 KB
/
22-sql_essentials.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
engine: knitr
---
# SQL essentials {#sec-sql}
**Prerequisites**
- Read *Early History of SQL*, [@Chamberlin2012]
- An interesting overview of the development of SQL.
**Key concepts and skills**
- While we could use various R package, or write SQL within RStudio, the industry demand for SQL, makes it worthwhile learning independently of R, at least initially.
- SQLite is one flavor of SQL, and we can use DB Browser for SQLite as an IDE for SQL.
**Key packages and functions**
- `BETWEEN`
- `DESC`
- `DISTINCT`
- `FROM`
- `GROUP BY`
- `LEFT JOIN`
- `LIKE`
- `LIMIT`
- `OR`
- `ORDER BY`
- `SELECT`
- `UPDATE`
- `WHERE`
## Introduction
Structured Query Language (SQL) ("see-quell" or "S.Q.L.") is used with relational databases. A relational database is a collection of at least one table, and a table is just some data organized into rows and columns. If there is more than one table in the database, then there should be some column that links them. An example is the `AustralianPoliticians` datasets that are used in @sec-r-essentials. Using SQL feels a bit like HTML/CSS in terms of being halfway between markup and programming. One fun aspect is that, by convention, commands are written in upper case. Another is that line spaces mean nothing: include them or do not, but always end a SQL command in a semicolon;
SQL was developed in the 1970s at IBM. SQL is an especially popular way of working with data. There are many "flavors" of SQL, including both closed and open options. Here we introduce SQLite, which is open source, and pre-installed on Macs. Windows users can install it from [here](https://www.sqlite.org/download.html).
Advanced SQL users do a lot with it alone, but even just having a working knowledge of SQL increases the number of datasets that we can access. A working knowledge of SQL is especially useful for our efficiency because a large number of datasets are stored on SQL servers, and being able to get data from them ourselves is handy.
We could use SQL within RStudio, especially drawing on `DBI` [@dbi]. Although given the demand for SQL skills, independent of demand for R skills, it may be a better idea, from a career perspective to have a working knowledge of it that is independent of RStudio. We can consider many SQL commands as straightforward variants of the `dplyr` verbs that we have used throughout this book. Indeed, if we wanted to stay within R, then `dbplyr` [@dbplyr] would explicitly allow us to use `dplyr` functions and would then automatically translate them into SQL. Having used `mutate()`, `filter()`, and `left_join()` in the `tidyverse` means that many of the core SQL commands will be familiar. That means that the main difficulty will be getting on top of the order of operations because SQL can be pedantic.
## Getting started
To get started with SQL, download [DB Browser for SQLite](https://sqlitebrowser.org) (DB4S), which is free and open source, and open it (@fig-sql).
![Opening DB Browser for SQLite](figures/sql_01.png){#fig-sql width=70% fig-align="center"}
Download "AustralianPoliticians.db" [here](https://github.com/RohanAlexander/telling_stories/blob/main/inputs/AustralianPoliticians.db) and then open it with "Open Database" and navigate to where you downloaded the database.
There are three key SQL commands that we now cover: `SELECT`, `FROM`, and `WHERE`. `SELECT` allows us to specify particular columns of the data, and we can consider `SELECT` in a similar way to `select()`. In the same way that we need to specify a dataset with `select()` and did that using a pipe operator, we specify a dataset with `FROM`. For instance, we could open "Execute SQL", and then type the following, and click "Execute".
```{sql}
#| include: true
#| eval: false
SELECT
surname
FROM
politicians;
```
The result is that we obtain the column of surnames. We could select multiple columns by separating them with commas, or all of them by using an asterisk, although this is not best practice because if the dataset were to change without us knowing then our result would differ.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname
FROM
politicians;
```
```{sql}
#| include: true
#| eval: false
SELECT
*
FROM
politicians;
```
And, finally, if there were repeated rows, then we could just look at the unique ones using `DISTINCT`, in a similar way to `distinct()`.
```{sql}
#| include: true
#| eval: false
SELECT
DISTINCT surname
FROM
politicians;
```
So far we have used `SELECT` along with `FROM`. The third command that is commonly used is `WHERE`, and this will allow us to focus on particular rows, in a similar way to `filter()`.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName
FROM
politicians
WHERE
firstName = "Myles";
```
All the usual logical operators are fine with `WHERE`, such as "=", "!=", ">", "<", ">=", and "<=". We could combine conditions using `AND` and `OR`.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName
FROM
politicians
WHERE
firstName = "Myles"
OR firstName = "Ruth";
```
If we have a query that gave a lot of results, then we could limit the number of them with `LIMIT`.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName
FROM
politicians
WHERE
firstName = "Robert" LIMIT 5;
```
And we could specify the order of the results with `ORDER`.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName
FROM
politicians
WHERE
firstName = "Robert"
ORDER BY
surname DESC;
```
See the rows that are pretty close to a criteria:
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName
FROM
politicians
WHERE
firstName LIKE "Ma__";
```
The "_" above is a wildcard that matches to any character. This provides results that include "Mary" and "Mark". `LIKE` is not case-sensitive: "Ma\_\_" and "ma\_\_" both return the same results.
Focusing on missing data is possible using "NULL" or "NOT NULL".
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName,
comment
FROM
politicians
WHERE
comment IS NULL;
```
An ordering is applied to number, date, and text fields that means we can use `BETWEEN` on all those, not just numeric. For instance, we could look for all surnames that start with a letter between X and Z (not including Z).
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName
FROM
politicians
WHERE
surname BETWEEN "X" AND "Z";
```
Using `WHERE` with a numeric variable means that `BETWEEN` is inclusive, compared with the example with letters which is not.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName,
birthYear
FROM
politicians
WHERE
birthYear BETWEEN 1980 AND 1990;
```
In addition to providing us with dataset observations that match what we asked for, we can modify the dataset. For instance, we could edit a value using `UPDATE` and `SET`.
```{sql}
#| include: true
#| eval: false
UPDATE
politicians
SET
displayName = "John Gilbert Alexander"
WHERE
uniqueID = "Alexander1951";
```
We can integrate if-else logic with `CASE` and `ELSE`. For instance, we add a column called "wasTreasurer", which is "Yes" in the case of "Josh Frydenberg", and "No" in the case of "Kevin Rudd", and "Unsure" for all other cases.
```{sql}
#| include: true
#| eval: false
SELECT
uniqueID,
surname,
firstName,
birthYear,
CASE
WHEN uniqueID = "Frydenberg1971" THEN "Yes"
WHEN surname = "Rudd" THEN "No"
ELSE "Unsure"
END AS "wasTreasurer"
FROM
politicians;
```
We can create summary statistics using commands such as `COUNT`, `SUM`, `MAX`, `MIN`, `AVG`, and `ROUND` in the place of `summarize()`. `COUNT` counts the number of rows that are not empty for some column by passing the column name, and this is similarly how `MIN`, etc, work.
```{sql}
#| include: true
#| eval: false
SELECT
COUNT(uniqueID)
FROM
politicians;
```
```{sql}
#| include: true
#| eval: false
SELECT
MIN(birthYear)
FROM
politicians;
```
We can get results based on different groups in our dataset using `GROUP BY`, in a similar manner to `group_by` in R.
```{sql}
#| include: true
#| eval: false
SELECT
COUNT(uniqueID)
FROM
politicians
GROUP BY
gender;
```
And finally, we can combine two tables using `LEFT JOIN`. We need to be careful to specify the matching columns using dot notation.
```{sql}
#| include: true
#| eval: false
SELECT
politicians.uniqueID,
politicians.firstName,
politicians.surname,
party.partySimplifiedName
FROM
politicians
LEFT JOIN
party
ON politicians.uniqueID = party.uniqueID;
```
As SQL is not our focus we have only provided a brief overview of some essential commands. From a career perspective you should develop a comfort with SQL. It is so integrated into data science that it would be "difficult to get too far without it" [@robinsonnolis2020, p. 8] and that "almost any" data science interview will include questions about SQL [@robinsonnolis2020, p. 110].
## Exercises
### Practice {.unnumbered}
Please submit a screenshot showing you got at least 70 per cent in the free w3school [SQL Quiz](https://www.w3schools.com/quiztest/quiztest.asp?qtest=SQL). You may like to go through their [tutorial](https://www.w3schools.com/sql/default.asp), but the SQL content in this chapter (combined with your `dplyr` experience) is sufficient to get 70 per cent. Please include the time and date in the screenshot i.e. take a screenshot of your whole screen, not just the browser.
### Quiz {.unnumbered}
1. What is SQL primarily used for?
a. Styling web pages
b. Managing and querying relational databases
c. Creating machine learning models
d. Designing graphic interfaces
2. In SQL, which command is used to retrieve specific columns from a table?
a. JOIN
b. WHERE
c. SELECT
d. FROM
3. Which clause in SQL is used to specify the table to query data from?
a. WHERE
b. FROM
c. GROUP BY
d. SELECT
4. What does the SQL command WHERE do?
a. Joins two tables together
b. Sorts the data
c. Groups records with identical data
d. Filters records based on specified conditions
5. How do you select all columns from a table named employees?
a. SELECT # FROM employees;
b. SELECT * FROM employees;
c. SELECT all FROM employees;
d. SELECT columns FROM employees;
6. Which SQL keyword is used to eliminate duplicate rows in the result set?
a. UNIQUE
b. DISTINCT
c. REMOVE
d. DELETE
7. What is the purpose of the LIMIT clause in an SQL query?
a. To set a maximum value for a column
b. To restrict the number of rows returned
c. To restrict the number of columns displayed
d. To enforce access control
8. Which SQL clause is used to sort the result set of a query?
a. SORT
b. ORDER BY
c. SORT BY
d. ORDER
9. In SQL, what does the wildcard character _ represent when used with the LIKE operator?
a. Any numeric digit
b. A space character
c. Any single character
d. Zero or more characters
10. How would you select records where the firstName starts with 'Ma' and is followed by any two characters?
a. WHERE firstName LIKE 'Ma*';
b. WHERE firstName LIKE 'Ma__';
c. WHERE firstName LIKE 'Ma??';
d. WHERE firstName LIKE 'Ma%';
11. Which SQL statement is used to update data in a database table?
a. SET
b. CHANGE
c. UPDATE
d. MODIFY
12. How do you check for NULL values in an SQL query?
a. WHERE column LIKE NULL
b. WHERE column EQUALS NULL
c. WHERE column = NULL
d. WHERE column IS NULL
13. What does the BETWEEN operator do in SQL?
a. Selects values within a given range
b. Sorts the result set
c. Checks if a value is null
d. Combines multiple conditions
14. Which SQL keyword is used to group rows that share a property so that an aggregate function can be applied to each group?
a. GROUP BY
b. HAVING
c. DISTINCT
d. ORDER BY
15. What is the purpose of the LEFT JOIN in SQL?
a. Combines all rows from two tables
b. Returns all records from the right table and matched records from the left table
c. Returns all records from the left table and matched records from the right table
d. Returns rows that have matching values in both tables
16. Why is it not best practice to use SELECT * in SQL queries?
a. It is slower than specifying columns
b. It is not supported in all SQL flavor
c. It doesn't return any data
d. It can lead to unexpected results if the database schema changes
### Task {.unnumbered}
Get the SQL dataset from [here](https://jacobfilipp.com/hammer/).
Use SQL (not R or Python) to make some finding using this observational data. Write a short paper using Quarto (you are welcome to use R/Python to make graphs but not for data preparation/manipulation which should occur in SQL in a separate script). In the discussion please have one sub-section each on: 1) correlation vs. causation; 2) missing data; 3) sources of bias.
Submit a link to a GitHub repo (one repo per group) that meets the general expectations.
Components of the rubric that are relevant are: "R/Python is cited", "Data are cited", "Class paper", "LLM usage is documented", "Title", "Author, date, and repo", "Abstract", "Introduction", "Data", "Measurement", "Results", "Discussion", "Prose", "Cross-references", "Captions", "Graphs/tables/etc", "Referencing", "Commits", "Sketches", "Simulation", "Tests", and "Reproducible workflow".