Slides, notes, and materials for the workshop
- RSVP link: https://lu.ma/1wu5ppl5
- Host: @chiphuyen's Discord community
- YouTube's recording
- The talks are pretty technical, given that this is a workshop on GPU optimization. The speakers try their best to make their topics accessible, but you’ll make more out of the workshop if you familiarize yourself with the basic concepts in advance. (See Reading materials)
- The event will be livestreamed on YouTube, but questions should be asked on Discord, not YouTube.
- Given that we have 2000+ people signing up for the event, we expect there will be a lot of interesting live discussions on Discord.
- Workshop TAs who will be helping us run the workshop:
[12:00] Crash course on GPU optimization (Mark Saroufim @ Meta)
Mark is a PyTorch core developer and cofounder of CUDA MODE. He also ran the really fun NeurIPS LLM Efficiency challenge last year. Previously, he was at Graphcore and Microsoft.
Mark will give an overview of why GPUs, the metrics that matter, and different GPU programming models (thread-based CUDA and block-based Triton). He promises this will be a painless guide to writing CUDA/Triton kernels! This talk will give us the basics to understand the rest of the workshop.
[12:45] High-performance LLM serving on GPUs (Sharan Chetlur @ NVIDIA)
Sharan is a principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, optimizing the performance of deep learning models from a single GPU to a full data center scale. Previously, he was the Director of Engineering at Cerebras.
Sharan will discuss how to build performant, flexible solutions to optimize LLM serving given the rapid evolution of new models and techniques. The talk will cover optimization techniques such as token concatenation, different strategies for batching, and cache.
[13:20] Block-based GPU Programming with Triton (Philippe Tillet @ OpenAI)
Philippe is currently leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana.
Philippe will explain how Triton works and how its block-based programming model differs from the traditional single instruction, multiple threads (SIMT) programming model that CUDA follows. Triton aims to be higher-level than CUDA while being more expressive (lower-level) than common graph compilers like XLA and Torch-Inductor.
[14:00] Scaling data processing from CPU to distributed GPUs (William Malpica @ Voltron Data)
William is a co-founder of Voltron Data and the creator of BlazingSQL. He helped scale Theseus, a GPU-native query engine, to handle 100TB queries!
Most people today use GPUs for training and inference. A category of workloads that GPUs excel at but are underutilized for is data processing. In this talk, William will discuss why large-scale data processing should be done on GPUs instead of CPUs and how different tools like cuDF, RAPIDS, and Theseus leverage GPUs for data processing.
Please read the schedule below carefully. If there are terms you’re not familiar with, you might want to look them up in advance. Examples:
- Memory bound vs. compute bound: whether the bottleneck is in GPU’s memory or in computation capabilities.
- Thread-based vs. block-based: different programming models for GPU programming. CUDA is thread-based and Triton is block-based.
Tools that will be discussed in the workshop:
- Development repository for the Triton language and compiler
- TensorRT and TensorRT-LLM
- Check out Mark’s lecture on profiling CUDA in PyTorch. 3. Model Inference Optimization Checklist 4. Accelerating Generative AI with PyTorch: Segment Anything, Fast
- rapidsai/cudf - GPU DataFrame Library
- Benchmarking Report: Theseus Engine | Voltron Data
Recommended resources:
- How CUDA Programming Works - Stephen Jones, NVIDIA (great lecture)
- The Best GPUs for Deep Learning in 2023 — An In-depth Analysis (Tim Dettmers)
- CUDA MODE Discord. They have a great lecture series on GPU optimization.