22-sql_essentials.qmd

---
engine: knitr
---

# SQL essentials {#sec-sql}

**Prerequisites**

- Read *Early History of SQL*, [@Chamberlin2012]
  - An interesting overview of the development of SQL.

**Key concepts and skills**

- While we could use various R package, or write SQL within RStudio, the industry demand for SQL, makes it worthwhile learning independently of R, at least initially.
- SQLite is one flavor of SQL, and we can use DB Browser for SQLite as an IDE for SQL.

**Key packages and functions**

- `BETWEEN`
- `DESC`
- `DISTINCT`
- `FROM`
- `GROUP BY`
- `LEFT JOIN`
- `LIKE`
- `LIMIT`
- `OR`
- `ORDER BY`
- `SELECT`
- `UPDATE`
- `WHERE`

## Introduction

Structured Query Language (SQL) ("see-quell" or "S.Q.L.") is used with relational databases. A relational database is a collection of at least one table, and a table is just some data organized into rows and columns. If there is more than one table in the database, then there should be some column that links them. An example is the `AustralianPoliticians` datasets that are used in @sec-r-essentials. Using SQL feels a bit like HTML/CSS in terms of being halfway between markup and programming. One fun aspect is that, by convention, commands are written in upper case. Another is that line spaces mean nothing: include them or do not, but always end a SQL command in a semicolon;

SQL was developed in the 1970s at IBM. SQL is an especially popular way of working with data. There are many "flavors" of SQL, including both closed and open options. Here we introduce SQLite, which is open source, and pre-installed on Macs. Windows users can install it from [here](https://www.sqlite.org/download.html).

Advanced SQL users do a lot with it alone, but even just having a working knowledge of SQL increases the number of datasets that we can access. A working knowledge of SQL is especially useful for our efficiency because a large number of datasets are stored on SQL servers, and being able to get data from them ourselves is handy. 

We could use SQL within RStudio, especially drawing on `DBI` [@dbi]. Although given the demand for SQL skills, independent of demand for R skills, it may be a better idea, from a career perspective to have a working knowledge of it that is independent of RStudio. We can consider many SQL commands as straightforward variants of the `dplyr` verbs that we have used throughout this book. Indeed, if we wanted to stay within R, then `dbplyr` [@dbplyr] would explicitly allow us to use `dplyr` functions and would then automatically translate them into SQL. Having used `mutate()`, `filter()`, and `left_join()` in the `tidyverse` means that many of the core SQL commands will be familiar. That means that the main difficulty will be getting on top of the order of operations because SQL can be pedantic.

## Getting started

To get started with SQL, download [DB Browser for SQLite](https://sqlitebrowser.org) (DB4S), which is free and open source, and open it (@fig-sql). 

![Opening DB Browser for SQLite](figures/sql_01.png){#fig-sql width=70% fig-align="center"}

Download "AustralianPoliticians.db" [here](https://github.com/RohanAlexander/telling_stories/blob/main/inputs/AustralianPoliticians.db) and then open it with "Open Database" and navigate to where you downloaded the database.

There are three key SQL commands that we now cover: `SELECT`, `FROM`, and `WHERE`. `SELECT` allows us to specify particular columns of the data, and we can consider `SELECT` in a similar way to `select()`. In the same way that we need to specify a dataset with `select()` and did that using a pipe operator, we specify a dataset with `FROM`. For instance, we could open "Execute SQL", and then type the following, and click "Execute".

```{sql}
#| include: true
#| eval: false

SELECT
    surname   
FROM
    politicians;
```

The result is that we obtain the column of surnames. We could select multiple columns by separating them with commas, or all of them by using an asterisk, although this is not best practice because if the dataset were to change without us knowing then our result would differ.

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname   
FROM
    politicians;
```

```{sql}
#| include: true
#| eval: false

SELECT
    *
FROM
    politicians;
```

And, finally, if there were repeated rows, then we could just look at the unique ones using `DISTINCT`, in a similar way to `distinct()`.

```{sql}
#| include: true
#| eval: false

SELECT
    DISTINCT surname   
FROM
    politicians;
```

So far we have used `SELECT` along with `FROM`. The third command that is commonly used is `WHERE`, and this will allow us to focus on particular rows, in a similar way to `filter()`.

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName   
FROM
    politicians     
WHERE
    firstName = "Myles";
```

All the usual logical operators are fine with `WHERE`, such as "=", "!=", ">", "<", ">=", and "<=". We could combine conditions using `AND` and `OR`.

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName   
FROM
    politicians     
WHERE
    firstName = "Myles" 
    OR firstName = "Ruth";
```

If we have a query that gave a lot of results, then we could limit the number of them with `LIMIT`. 

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName        
FROM
    politicians          
WHERE
    firstName = "Robert"       LIMIT 5;
```

And we could specify the order of the results with `ORDER`.

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName 
FROM
    politicians 
WHERE
    firstName = "Robert" 
ORDER BY
    surname DESC;
```

See the rows that are pretty close to a criteria:

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName      
FROM
    politicians      
WHERE
    firstName LIKE "Ma__";
```

The "_" above is a wildcard that matches to any character. This provides results that include "Mary" and "Mark". `LIKE` is not case-sensitive: "Ma\_\_" and "ma\_\_" both return the same results.

Focusing on missing data is possible using "NULL" or "NOT NULL". 

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName,
    comment
FROM
    politicians      
WHERE
    comment     IS NULL;
```

An ordering is applied to number, date, and text fields that means we can use `BETWEEN` on all those, not just numeric. For instance, we could look for all surnames that start with a letter between X and Z (not including Z).

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName
FROM
    politicians      
WHERE
    surname     BETWEEN "X" AND "Z";
```

Using `WHERE` with a numeric variable means that `BETWEEN` is inclusive, compared with the example with letters which is not.

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName,
    birthYear
FROM
    politicians      
WHERE
    birthYear     BETWEEN 1980 AND 1990;
```

In addition to providing us with dataset observations that match what we asked for, we can modify the dataset. For instance, we could edit a value using `UPDATE` and `SET`.

```{sql}
#| include: true
#| eval: false

UPDATE
    politicians   
SET
    displayName = "John Gilbert Alexander"
WHERE
    uniqueID = "Alexander1951";
```

We can integrate if-else logic with `CASE` and `ELSE`. For instance, we add a column called "wasTreasurer", which is "Yes" in the case of "Josh Frydenberg", and "No" in the case of "Kevin Rudd", and "Unsure" for all other cases.

```{sql}
#| include: true
#| eval: false

SELECT
    uniqueID,
    surname,
    firstName,
    birthYear,
    CASE                  
        WHEN uniqueID = "Frydenberg1971" THEN "Yes"                  
        WHEN surname = "Rudd" THEN "No"                  
        ELSE "Unsure"            
    END AS "wasTreasurer"        
FROM
    politicians;
```

We can create summary statistics using commands such as `COUNT`, `SUM`, `MAX`, `MIN`, `AVG`, and `ROUND` in the place of `summarize()`. `COUNT` counts the number of rows that are not empty for some column by passing the column name, and this is similarly how `MIN`, etc, work.

```{sql}
#| include: true
#| eval: false

SELECT
    COUNT(uniqueID)   
FROM
    politicians;
```

```{sql}
#| include: true
#| eval: false

SELECT
    MIN(birthYear)   
FROM
    politicians;
```

We can get results based on different groups in our dataset using `GROUP BY`, in a similar manner to `group_by` in R.

```{sql}
#| include: true
#| eval: false

SELECT
    COUNT(uniqueID)   
FROM
    politicians     
GROUP BY
    gender;
```

And finally, we can combine two tables using `LEFT JOIN`. We need to be careful to specify the matching columns using dot notation. 

```{sql}
#| include: true
#| eval: false

SELECT
    politicians.uniqueID,
    politicians.firstName,
    politicians.surname,
    party.partySimplifiedName   
FROM
    politicians 
LEFT JOIN
    party     
        ON politicians.uniqueID = party.uniqueID;
```

As SQL is not our focus we have only provided a brief overview of some essential commands. From a career perspective you should develop a comfort with SQL. It is so integrated into data science that it would be "difficult to get too far without it" [@robinsonnolis2020, p. 8] and that "almost any" data science interview will include questions about SQL [@robinsonnolis2020, p. 110].


## Exercises

### Practice {.unnumbered}

Please submit a screenshot showing you got at least 70 per cent in the free w3school [SQL Quiz](https://www.w3schools.com/quiztest/quiztest.asp?qtest=SQL). You may like to go through their [tutorial](https://www.w3schools.com/sql/default.asp), but the SQL content in this chapter (combined with your `dplyr` experience) is sufficient to get 70 per cent. Please include the time and date in the screenshot i.e. take a screenshot of your whole screen, not just the browser.

### Quiz {.unnumbered}

1. What is SQL primarily used for?
    a. Styling web pages
    b.  Managing and querying relational databases
    c. Creating machine learning models
    d. Designing graphic interfaces
2. In SQL, which command is used to retrieve specific columns from a table?
    a. JOIN
    b. WHERE
    c.  SELECT
    d. FROM
3. Which clause in SQL is used to specify the table to query data from?
    a. WHERE
    b.  FROM
    c. GROUP BY
    d. SELECT
4. What does the SQL command WHERE do?
    a. Joins two tables together
    b. Sorts the data
    c. Groups records with identical data
    d.  Filters records based on specified conditions
5. How do you select all columns from a table named employees?
    a. SELECT # FROM employees;
    b.  SELECT * FROM employees;
    c. SELECT all FROM employees;
    d. SELECT columns FROM employees;
6. Which SQL keyword is used to eliminate duplicate rows in the result set?
    a. UNIQUE
    b.  DISTINCT
    c. REMOVE
    d. DELETE
7. What is the purpose of the LIMIT clause in an SQL query?
    a. To set a maximum value for a column
    b.  To restrict the number of rows returned
    c. To restrict the number of columns displayed
    d. To enforce access control
8. Which SQL clause is used to sort the result set of a query?
    a. SORT
    b.  ORDER BY
    c. SORT BY
    d. ORDER
9. In SQL, what does the wildcard character _ represent when used with the LIKE operator?
    a. Any numeric digit
    b. A space character
    c.  Any single character
    d. Zero or more characters
10. How would you select records where the firstName starts with 'Ma' and is followed by any two characters?
    a. WHERE firstName LIKE 'Ma*';
    b.  WHERE firstName LIKE 'Ma__';
    c. WHERE firstName LIKE 'Ma??';
    d. WHERE firstName LIKE 'Ma%';
11. Which SQL statement is used to update data in a database table?
    a. SET
    b. CHANGE
    c.  UPDATE
    d. MODIFY
12. How do you check for NULL values in an SQL query?
    a. WHERE column LIKE NULL
    b. WHERE column EQUALS NULL
    c. WHERE column = NULL
    d.  WHERE column IS NULL
13. What does the BETWEEN operator do in SQL?
    a.  Selects values within a given range
    b. Sorts the result set
    c. Checks if a value is null
    d. Combines multiple conditions
14. Which SQL keyword is used to group rows that share a property so that an aggregate function can be applied to each group?
    a.  GROUP BY
    b. HAVING
    c. DISTINCT
    d. ORDER BY
15. What is the purpose of the LEFT JOIN in SQL?
    a. Combines all rows from two tables
    b. Returns all records from the right table and matched records from the left table
    c.  Returns all records from the left table and matched records from the right table
    d. Returns rows that have matching values in both tables
16. Why is it not best practice to use SELECT * in SQL queries?
    a. It is slower than specifying columns
    b. It is not supported in all SQL flavor 
    c. It doesn't return any data
    d.  It can lead to unexpected results if the database schema changes

### Task {.unnumbered}

Get the SQL dataset from [here](https://jacobfilipp.com/hammer/).

Use SQL (not R or Python) to make some finding using this observational data. Write a short paper using Quarto (you are welcome to use R/Python to make graphs but not for data preparation/manipulation which should occur in SQL in a separate script). In the discussion please have one sub-section each on: 1) correlation vs. causation; 2) missing data; 3) sources of bias.

Submit a link to a GitHub repo (one repo per group) that meets the general expectations. 

Components of the rubric that are relevant are: "R/Python is cited", "Data are cited", "Class paper", "LLM usage is documented", "Title", "Author, date, and repo", "Abstract", "Introduction", "Data", "Measurement", "Results", "Discussion", "Prose", "Cross-references", "Captions", "Graphs/tables/etc", "Referencing", "Commits", "Sketches", "Simulation", "Tests", and "Reproducible workflow".