Skip to content

vhfgosselin/Taipy-Dask-Demo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Querying Big Data using Taipy and Dask

WebApp

This project uses Taipy to create a Dask pipeline to run queries on a 24Gb dataset of Yelp Reviews and build a web app to run these queries and display the results.

Taipy is a great way to manage and display the results of Dask applications, as its backend is built for large-scale applications and can handle caching and parallelization.

Table of Contents

Why Taipy?

Taipy is an open-source Python library that manages both front and back-end:

  • Taipy GUI helps create web apps quickly using only Python code
  • Taipy Core helps manage data pipelines by visualizing them and caching already computed results

Why Dask?

Pandas is an excellent library for data analysis, but it is not designed to handle large datasets. Dask is a library that extends Pandas to parallelize computations and handle bigger-than-memory datasets.

Datasets

The datasets used in this project are based on Yelp Reviews datasets:

  • data/yelp_business.csv contains information about businesses (mainly name and id)
  • data/yelp.csv is a 24Gb dataset containing Yelp reviews (mainly business id, text, and stars)

The goal will be to join these two datasets and run queries to find average stars and reviews for a specific business.

Data Pipeline

The Data Pipeline is built using Taipy Studio in VSCode and looks like this:

Pipeline

Blue nodes are Data Nodes that store variables or datasets:

Pipeline1

  • business_data is the yelp_business.csv dataset as a Pandas DataFrame
  • business_dict is a dictionary mapping business ids to business names
  • business_name is the name of the business we want to query
  • business_id is the id of the business we want to query
  • review_data is the yelp_review.csv dataset as a Dask DataFrame

Generic Data Node

review_data is a generic data node that calls a Python function read_fct to read the dataset with a read_fct_params argument to specify the path.

Pipeline2

  • raw_reviews is the reviews that we queried
  • parsed_reviews is raw_reviews but filtered only to contain relevant columns

Between the blue nodes are orange Task Nodes, which take blue nodes as inputs and return blue nodes as outputs using Python functions.

Pipeline3

These Task Nodes are called by a green node called the Pipeline** Node**, which is the entry point of the pipeline.

Task Nodes have a skippable property which allows them to be skipped if the output was already computed and cached

  • For example, if we already ran a first query and then run a second one, Taipy will log:
  • [2023-05-16 04:40:06,858][Taipy][INFO] job JOB_read_review_39d2bb45-8901-4081-b877-2e308507bb90 is skipped.
  • meaning it did not reread the dataset but used the cached result instead.

This pipeline can find reviews for a specific business from the datasets.

Web App

The web app is built using Taipy GUI and looks like this:

Dropdown

The app allows you to select a business from a dropdown menu. This will call the pipeline, run the query in 5 minutes and display the results: average stars and reviews for the selected business.

How to Run

You can run the app using this repo which contains a smaller version of the dataset (30Mb):

  1. Clone the repository
git clone https://github.com/AlexandreSajus/Taipy-Dask-Demo.git
  1. Install the requirements
pip install -r requirements.txt
  1. Run the web app
python app.py

About

Querying Big Data using Taipy and Dask

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%