jupytext | kernelspec | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
To start off, install the latest version of duckdb
, magic-duckdb
and dash
to run this notebook.
!pip install --upgrade duckdb magic-duckdb dash -q
%load_ext magic_duckdb
We're also going to create a helper variable IN_COLAB
to see if we're running Google Colab. This will come in handy later.
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
This cell downloads the answers for the exercises.
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_4.zip -q
!unzip -o answers_4.zip -d answers
If you haven't already done it, sign up for MotherDuck.
To connect to MotherDuck, all you need to do is connect to a duckdb
database! Your MotherDuck databases will be accessible with the md:
prefix. For example, to connect to the sample_data
database and show the tables, uncomment the following lines and run:
import duckdb
con = duckdb.connect("md:sample_data")
However, this will throw an error! You actually need to specify your authentication token to connect to MotherDuck.
To do so, you can copy your token from Motherduck and add it to your notebook "Secrets".
If you are using Google Colab, you can click on the "Secrets" tab and add a new "token" secret there. See how to do that in the screenshot below.
Now, you can get your token from the secrets manager and load it into an environment variable. After this, you can connect to MotherDuck without any extra authentication steps!
import os
if IN_COLAB:
from google.colab import userdata
os.environ["motherduck_token"] = userdata.get('token')
If you're running in a Jupyter Notebook elsewhere, you can uncomment and run the following, and paste your token in the input field:
# import getpass
# os.environ["motherduck_token"] = getpass.getpass(prompt='Password: ', stream=None)
Create a connection to MotherDuck and show all tables in your `sample_data` database. You can use the `SHOW TABLES` command that is documented [here](https://duckdb.org/docs/guides/meta/list_tables.html).
You are now all ready to go and query your Cloud data warehouse! One example in the sample_data
database is the service_requests
table, which contains New York City 311 Service Requests with requests to the city's complaint service from 2010 to the present.
To query the data, you'll want to fully specify the table name with the following format:
<database name>.<schema>.<table name>
For example, you can run the below cell to get the service requests between March 27th and 31st of 2022:
%%dql -co con
SELECT
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';
Run `DESCRIBE` on the `sample_data.who.ambient_air_quality` table to inspect the column names. Write a query that gets the average concentrations of PM1.0 and PM2.5 particles for the `'United States of America'`, for the last 10 years, grouped and ordered by year.
# Uncomment and run to show solution
# !cat ./answers/answer_4.02.py
Now, let's try to load some data from a data source into MotherDuck. HuggingFace has recently released an extension for DuckDB, that lets you access and query their entire datasets library!
To query a HuggingFace dataset, you can run:
%%dql -co con
SELECT * FROM read_parquet('hf://datasets/datonic/threatened_animal_species/data/threatened_animal_species.parquet');
Before we create a new table with this data, let's first swap to a different database. You can do so by creating a new DuckDB connection, or by changing the database with the USE
statement. For example, to connect to your default database, my_db
, run:
%%dql -co con
USE my_db;
Create a new table called `animals` in your MotherDuck database `md:my_db` based on the `datonic/threatened_animal_species` dataset.
# Uncomment and run to show solution
# !cat ./answers/answer_4.03.py
DuckDB releases are each named after a duck! Let's load [this data](https://duckdb.org/data/duckdb-releases.csv) into a new table called `duckdb_ducks`. You can use `read_csv` to load the data directly from the HTTP URL: `https://duckdb.org/data/duckdb-releases.csv`.
# Uncomment and run to show solution
# !cat ./answers/answer_4.04.py
Now, we have two tables that we can join together and share with our colleagues!
Let's inspect them and take a look at the columns we have available.
%%dql -co con
DESCRIBE animals;
%%dql -co con
DESCRIBE duckdb_ducks;
Now, we can get the endangered species status of all DuckDB ducks by joining the two.
Create a new table called `duckdb_species` that joins the `duckdb_ducks` and `animals` tables on the scientific name.
# Uncomment and run to show solution
# !cat ./answers/answer_4.05.py
To share your database, you can run:
%%dql -co con -o df
CREATE SHARE duck_share FROM my_db (ACCESS UNRESTRICTED);
Now you can print the share URL:
print(df.share_url.iloc[0])
Check out these datasets from Huggingface: https://huggingface.co/datasets. Pick one, create a share and send it to your neighbor!
To attach a share into your Cloud data warehouse, run:
ATTACH '<share_url>';
For example, to load the Mosaic example datasets, run
%%dql -co con
ATTACH 'md:_share/mosaic_examples/b01cfda8-239e-4148-a228-054b94cdc3b4';
You can then inspect the database and query the data like so:
%%dql -co con
USE mosaic_examples;
SHOW TABLES;
%%dql -co con
SELECT * FROM seattle_weather;
Attach the share you received from your neighbor and inspect the tables.
To detach a database someone shared with you, make sure it's not selected, and run DETACH
:
%%dql -co con
USE my_db;
DETACH mosaic_examples;
To drop the share you created, simply run:
%%dql -co con
DROP SHARE duck_share;
Now that your data is in the Cloud and easy to share, you can also create simple web apps that load and plot the data!
Here is an example Dash app that you can run to plot data in the sample_data
database.
:tags: [remove-output]
from dash import Dash, html, dcc, callback, Output, Input
import plotly.express as px
import pandas as pd
from sqlalchemy import create_engine, text
con = duckdb.connect("md:sample_data")
countries = con.sql('SELECT DISTINCT country_name as countries FROM who.ambient_air_quality ORDER BY country_name')
app = Dash()
app.layout = [
html.H1(children='Air quality by country', style={'textAlign':'center', 'font-family':'monospace'}),
dcc.Dropdown(countries.df().countries.values.tolist(), 'Canada', id='dropdown-selection'),
dcc.Graph(id='graph-content')
]
@callback(
Output('graph-content', 'figure'),
Input('dropdown-selection', 'value')
)
def update_graph(value):
sql = "SELECT year, avg(pm25_concentration) as avg_pm25 FROM who.ambient_air_quality WHERE country_name=? GROUP by year ORDER by year"
result = con.execute(sql, [value]).df()
return px.line(result, x='year', y='avg_pm25')
if __name__ == '__main__':
app.run(debug=True)
Create a dashboard that plots the data your neighbor shared with you. Share it in the `#scipy-2024` Slack channel so others can give it a try!