Releases: NVIDIA/cutlass
Releases · NVIDIA/cutlass
CUTLASS 3.8.0
CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.
- Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
- 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
- Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as
tmem
across CuTe as a first class data locale. - Exposure of
tmem->rmem
,rmem->tmem
andsmem->tmem data movement instructions
as copy atoms in CuTe. make_tmem_copy()
utility method to ease creation of tiled copies for tmem copy atoms.- Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
- Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
- Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
- Pipelines that implement Blackwell specific synchronization.
- Cluster launch control API supporting preferred and fallback cluster shapes.
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
- Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
- Blackwell specific kernel layers that
- Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
- Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
- Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- Non-block scaled data types without support for pointer array and grouped GEMM with TMA
- Non-block scaled data types with support for pointer array and grouped GEMM with TMA
- Block scaled data types without support for pointer array and grouped GEMM with TMA
- Block scaled data types with support for pointer array and grouped GEMM with TMA
- Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell epilogue that supports loading accumulators from
tmem
and full set of EVT fusions.
- Blackwell specific kernel layers that
- CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
- Support for mixed input GEMM kernels on Hopper in the profiler.
- New CUTLASS profiler flag
use-cuda-graphs
to reduce overheads when benchmarking launch-bound kernels. - A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (
./cutlass_profiler --operation=GroupedGemm --help
for details). - Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
- Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
- Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
- GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
- GEMM with CLC based StreamK scheduler for load balancing.
- Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
- Convolution kernels for fprop, dgrad, and wgrad.
- Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
- A new BF16x9 GEMM kernel that emulates FP32 GEMM (SGEMM) using BF16 operations.
- Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
- A set of new Hopper grouped GEMM kernels that support mixed A and B datatypes.
- A new Hopper FP8 GEMM with groupwise scaling.
- Documentation updates:
- Quickstart - instantiating a Blackwell block-scaled GEMM.
- Detailed Blackwell block-scaled GEMM functionality documentation
- A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
CUTLASS 3.7.0
- A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
- Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new
make_kernel_hardware_info
API as shown in example 48. - Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
CUTLASS 3.6.0
- Hopper structured sparse GEMM.
- A refactor to the CUTLASS 3.x convolution
kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. Now the 3.x convolution API is no longer considered as a beta API. - An improved mixed input GEMM and a lookup table implementation for
INT4
xFP8
scale-only mode. - EVT nodes for Top-K selection and softmax and GEMM example using those.
- Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
- A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
- A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
- A SIMT-enabled pointer-array epilogue.
- A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
- A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
- A new hardware support for comparisons and computations of
cutlass::bfloat16_t
- Fixed use of isnan on Windows for
half_t
.
CUTLASS 3.5.1
- Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
- Exposure of L2
cache_hint
s in TMA copy atoms - Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48. - TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
- A new
GemmSparseUniversal
API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference. - CUDA host adapter extensions to support TMA descriptor construction driver APIs.
- Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
- Support for residual add (beta != 0) in convolution kernels.
- A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
- A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
- A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
- Better support for MSVC as a host compiler.
- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
- NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution
kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. After this, the 3.x convolution API will no longer be considered as a beta API. - Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution
CUTLASS 3.5.0
- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
- Support for Fprop, Dgrad, and Wgrad algorithms.
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
- Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
- Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
- 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
- Updates to CuTe documentation for
cute::Tensor<>
, MMA atoms, and an overhauled CuTe GEMM tutorial series. - Extensions to CuTe to support L2 prefetching and TMA store+reductions.
- Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
- Fixes to greatly reduce build warnings.
- Updates and bugfixes from the community (thanks!)
CUTLASS 3.4.1
- Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
- Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs.
- Updates and bugfixes from the community (thanks!).
CUTLASS 3.4.0
- Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
- Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
- Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
- Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
- Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
- Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.
CUTLASS 3.3.0
- New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
- New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
- New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
- EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
- Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
- Support for Clang as a host compiler.
- Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface
CUTLASS 3.2.2
Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.
CUTLASS 3.2.1
- Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
- SM80 EVT support in C++ and Python.
- Other SM90 epilogue improvements.
- Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
- Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
- SM90 TF32 kernel improvements for all layouts.
- SM90 rasterization direction support in the CUTLASS profiler.
- Improvement for CUTLASS profiler build times.
- Remove Python-C++ bindings.