Add cudf snowflake example (#505)

rapidsai · Feb 10, 2025 · fbe0265 · fbe0265
1 parent 27a82cb
commit fbe0265
Show file tree

Hide file tree

Showing 2 changed files with 356 additions and 0 deletions.
diff --git a/source/examples/index.md b/source/examples/index.md
@@ -20,4 +20,5 @@ xgboost-rf-gpu-cpu-benchmark/notebook
 xgboost-dask-databricks/notebook
 xgboost-azure-mnmg-daskcloudprovider/notebook
 rapids-1brc-single-node/notebook
+rapids-snowflake-cudf/notebook
 ```
diff --git a/source/examples/rapids-snowflake-cudf/notebook.ipynb b/source/examples/rapids-snowflake-cudf/notebook.ipynb
@@ -0,0 +1,355 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fa159e8e-74d9-43d1-9238-14260c389960",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "library/cudf",
+     "platform/snowflake"
+    ]
+   },
+   "source": [
+    "# Getting Started with cudf.pandas and Snowflake\n",
+    "\n",
+    "## RAPIDS in Snowflake\n",
+    "\n",
+    "\n",
+    "[RAPIDS](https://rapids.ai/) is a suite of libraries to execute end-to-end data science pipelines entirely on GPUs. If you have data in a [Snowflake](https://www.snowflake.com/) table that you want to explore with the RAPIDS, you can deploy RAPIDS in Snowflake using Snowpark Container Services.\n",
+    "\n",
+    "```{docref} /platforms/snowflake\n",
+    "For the purpose of this example, follow the Run RAPIDS on Snowflake guide, befeore getting started.\n",
+    "```\n",
+    "\n",
+    "## NYC Parking Tickets `cudf.pandas` Example\n",
+    "\n",
+    "If you have data in a Snowflake table, you can accelerate your ETL workflow with `cuDF.pandas`. With `cudf.pandas` you can accelerate the `pandas` ecosystem, with zero code changes. Just load `cudf.pandas` and you will have the benefits of GPU acceleration, with automatic CPU fallback if needed. \n",
+    "\n",
+    "For this example, we have a Snowflake table with the [Parking Violations Issued - Fiscal Year 2022](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/7mxj-7a6y/about_data) dataset from NYC Open Data. \n",
+    "\n",
+    "\n",
+    "### Get data into a Snowflake table\n",
+    "\n",
+    "To follow along, you will need to have the NYC Parking Violations data into your snowflake account, and make sure that this data is accessible from the RAPIDS notebook Snowpark Service Container that you deployed following the [Run RAPIDS on Snowflake](../../platforms/snowflake) guide.\n",
+    "\n",
+    "In a Snowflake SQL sheet and with `ACCOUNTADMIN` role\n",
+    "\n",
+    "```sql\n",
+    "-- Create a database where the table would live --\n",
+    "\n",
+    "CREATE DATABASE CUDF_SNOWFLAKE_EXAMPLE;\n",
+    "\n",
+    "USE DATABASE DATABASE CUDF_SNOWFLAKE_EXAMPLE; \n",
+    "\n",
+    "CREATE OR REPLACE FILE FORMAT my_parquet_format\n",
+    "TYPE = 'PARQUET';\n",
+    "\n",
+    "CREATE OR REPLACE STAGE my_s3_stage\n",
+    "URL = 's3://rapidsai-data/datasets/nyc_parking/'\n",
+    "FILE_FORMAT = my_parquet_format;\n",
+    "\n",
+    "-- Infer schema from parquet file to use when creating table later --\n",
+    "SELECT COLUMN_NAME, TYPE\n",
+    "FROM TABLE(\n",
+    "    INFER_SCHEMA(\n",
+    "        LOCATION => '@my_s3_stage',\n",
+    "        FILE_FORMAT => 'my_parquet_format',\n",
+    "        FILES => ('nyc_parking_violations_2022.parquet')\n",
+    "    )\n",
+    ");\n",
+    "\n",
+    "-- Create table using the inferred schema in the previous step --\n",
+    "CREATE OR REPLACE TABLE NYC_PARKING_VIOLATIONS\n",
+    "  USING TEMPLATE (\n",
+    "    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))\n",
+    "      FROM TABLE(\n",
+    "        INFER_SCHEMA(\n",
+    "        LOCATION => '@my_s3_stage',\n",
+    "        FILE_FORMAT => 'my_parquet_format',\n",
+    "        FILES => ('nyc_parking_violations_2022.parquet')\n",
+    "        )\n",
+    "      ));\n",
+    "\n",
+    "-- Get data from the stage into the table --      \n",
+    "COPY INTO NYC_PARKING_VIOLATIONS\n",
+    "FROM @my_s3_stage\n",
+    "FILES = ('nyc_parking_violations_2022.parquet')\n",
+    "FILE_FORMAT = (TYPE = 'PARQUET')\n",
+    "MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;\n",
+    "```\n",
+    "\n",
+    "### Ensure access from container\n",
+    "\n",
+    "During the process of deploying RAPIDS in Snowflake, you created a `CONTAINER_USER_ROLE` and we need to make sure this role has access to the database, schema and table where the data is, to be able to query from it. \n",
+    "\n",
+    "```sql\n",
+    "-- Ensure the role has USAGE permissions on the database and schema\n",
+    "GRANT USAGE ON DATABASE CUDF_SNOWFLAKE_EXAMPLE TO ROLE CONTAINER_USER_ROLE;\n",
+    "GRANT USAGE ON SCHEMA CUDF_SNOWFLAKE_EXAMPLE.PUBLIC TO ROLE CONTAINER_USER_ROLE; \n",
+    "\n",
+    "-- Ensure the role has SELECT permission on the table\n",
+    "GRANT SELECT ON TABLE CUDF_SNOWFLAKE_EXAMPLE.PUBLIC.NYC_PARKING_VIOLATIONS TO ROLE CONTAINER_USER_ROLE;\n",
+    "```\n",
+    "\n",
+    "### Read data and play around. \n",
+    "\n",
+    "Now that you have the data in a Snowflake table, and the RAPIDS Snowpark container up and running, create a new notebook in the `workspace` directory (anything that is added to this directory will persist), and follow the instructions below.\n",
+    "\n",
+    "![](../../images/snowflake_jupyter.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d257ec35-1982-46f5-a134-af75300871dd",
+   "metadata": {},
+   "source": [
+    "### Load cudf.pandas\n",
+    "In the first cell of your notebook, load the `cudf.pandas` extension"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "054eaa32-6ebc-42ad-9946-ee91e3a2079c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext cudf.pandas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6bf7f25d-d05b-4905-9b16-a08081dc4638",
+   "metadata": {},
+   "source": [
+    "### Connect to Snowflake and create a Snowpark session "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "116536da-b579-445d-ada7-d95e1cc232aa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from pathlib import Path\n",
+    "\n",
+    "from snowflake.snowpark import Session\n",
+    "\n",
+    "connection_parameters = {\n",
+    "    \"account\": os.getenv(\"SNOWFLAKE_ACCOUNT\"),\n",
+    "    \"host\": os.getenv(\"SNOWFLAKE_HOST\"),\n",
+    "    \"token\": Path(\"/snowflake/session/token\").read_text(),\n",
+    "    \"authenticator\": \"oauth\",\n",
+    "    \"database\": \"CUDF_SNOWFLAKE_EXAMPLE\",  # the created database\n",
+    "    \"schema\": \"PUBLIC\",\n",
+    "    \"warehouse\": \"CONTAINER_HOL_WH\",\n",
+    "}\n",
+    "\n",
+    "session = Session.builder.configs(connection_parameters).create()\n",
+    "\n",
+    "# Check the session\n",
+    "print(\n",
+    "    f\"Current session info: Warehouse: {session.get_current_warehouse()}  \"\n",
+    "    f\"Database: {session.get_current_database()}    \"\n",
+    "    f\"Schema: {session.get_current_schema()}  \"\n",
+    "    f\"Role: {session.get_current_role()}\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d643cbdf-8b97-4278-acba-3ebaf404ea0b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get some interesting columns from the table\n",
+    "table = session.table(\"NYC_PARKING_VIOLATIONS\").select(\n",
+    "    \"Registration State\",\n",
+    "    \"Violation Description\",\n",
+    "    \"Vehicle Body Type\",\n",
+    "    \"Issue Date\",\n",
+    "    \"Summons Number\",\n",
+    ")\n",
+    "table"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "617b449a-a1ea-4484-8f69-61682a4d5236",
+   "metadata": {},
+   "source": [
+    "Notice that up to this point, we have a snowpark dataframe. To get a pandas dataframe we use `.to_pandas()`\n",
+    "\n",
+    "```{warning}\n",
+    "At the moment, there is a known issue that is preventing us to accelerate the following step with cudf, and we hope to solve this issue soon. In the meantime we need to do a workaround to get the data into a pandas dataframe that `cudf.pandas` can understand.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "55a225d7-8fde-4a72-b580-d65cfbc44906",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cudf.pandas.module_accelerator import disable_module_accelerator\n",
+    "\n",
+    "with disable_module_accelerator():\n",
+    "    df = table.to_pandas()\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.DataFrame(df)  # this will take a few seconds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca6c922c-b20d-4bae-85ef-7d78634b886c",
+   "metadata": {},
+   "source": [
+    "In the future the cell above will reduce to simple doing `df = table.to_pandas()`. \n",
+    "\n",
+    "But now we are ready to get see `cudf.pandas` in action. For the record, this dataset has `len(df) = 15435607` and you should see the following operations take in the order of milliseconds to run."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "873fb777-f888-499f-b502-e33d7da82238",
+   "metadata": {},
+   "source": [
+    "**Which parking violation is most commonly committed by vehicles from various U.S states?**\n",
+    "\n",
+    "Each record in our dataset contains the state of registration of the offending vehicle, and the type of parking offence. Let's say we want to get the most common type of offence for vehicles registered in different states. We can do: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4acbed7b-2386-4646-a3f6-1dd5e45e5646",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "(\n",
+    "    df[[\"Registration State\", \"Violation Description\"]]  # get only these two columns\n",
+    "    .value_counts()  # get the count of offences per state and per type of offence\n",
+    "    .groupby(\"Registration State\")  # group by state\n",
+    "    .head(\n",
+    "        1\n",
+    "    )  # get the first row in each group (the type of offence with the largest count)\n",
+    "    .sort_index()  # sort by state name\n",
+    "    .reset_index()\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4d2f86b5-ff79-466f-99ed-3fa23daabf77",
+   "metadata": {},
+   "source": [
+    "**Which vehicle body types are most frequently involved in parking violations?**\n",
+    "We can also investigate which vehicle body types most commonly appear in parking violations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "25082d0a-819f-49e2-a4de-663ce0d106f1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "(\n",
+    "    df.groupby([\"Vehicle Body Type\"])\n",
+    "    .agg({\"Summons Number\": \"count\"})\n",
+    "    .rename(columns={\"Summons Number\": \"Count\"})\n",
+    "    .sort_values([\"Count\"], ascending=False)\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd690cca-b20c-480c-b9aa-276484ec1742",
+   "metadata": {},
+   "source": [
+    "**How do parking violations vary across days of the week?**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "332e335f-1dce-43ee-a335-65c77c599ad2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "weekday_names = {\n",
+    "    0: \"Monday\",\n",
+    "    1: \"Tuesday\",\n",
+    "    2: \"Wednesday\",\n",
+    "    3: \"Thursday\",\n",
+    "    4: \"Friday\",\n",
+    "    5: \"Saturday\",\n",
+    "    6: \"Sunday\",\n",
+    "}\n",
+    "\n",
+    "df[\"Issue Date\"] = df[\"Issue Date\"].astype(\"datetime64[ms]\")\n",
+    "df[\"issue_weekday\"] = df[\"Issue Date\"].dt.weekday.map(weekday_names)\n",
+    "\n",
+    "df.groupby([\"issue_weekday\"])[\"Summons Number\"].count().sort_values()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8f167a3-7c94-48ff-972d-d0bd23b37137",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "With `cudf.pandas` you can GPU accelerated workflows that involve data that is in a Snowflake table, by just reading it into a pandas d\n",
+    "\n",
+    "When things start to get a little slow, just load the cudf.pandas and run your existing code on a GPU!\n",
+    "\n",
+    "To learn more, we encourage you to visit [rapids.ai/cudf-pandas](https://rapids.ai/cudf-pandas/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5599cc6b-d809-4e6d-b67d-7309edde005e",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}