Releases · NVIDIA/cutlass

21 Feb 05:32

hwu36

v3.8.0

afa1772

CUTLASS 3.8.0 Latest

Latest

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
- 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
- Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as tmem across CuTe as a first class data locale.
- Exposure of tmem->rmem, rmem->tmem and smem->tmem data movement instructions as copy atoms in CuTe.
- make_tmem_copy() utility method to ease creation of tiled copies for tmem copy atoms.
- Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
- Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
- Pipelines that implement Blackwell specific synchronization.
- Cluster launch control API supporting preferred and fallback cluster shapes.
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
- Blackwell specific kernel layers that
  - Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
  - Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
  - Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell epilogue that supports loading accumulators from tmem and full set of EVT fusions.
CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
- Support for mixed input GEMM kernels on Hopper in the profiler.
New CUTLASS profiler flag use-cuda-graphs to reduce overheads when benchmarking launch-bound kernels.
A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (./cutlass_profiler --operation=GroupedGemm --help for details).
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
- Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
- Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
- GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
- GEMM with CLC based StreamK scheduler for load balancing.
- Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
- Convolution kernels for fprop, dgrad, and wgrad.
- Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
- A new BF16x9 GEMM kernel that emulates FP32 GEMM (SGEMM) using BF16 operations.
Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
- A set of new Hopper grouped GEMM kernels that support mixed A and B datatypes.
- A new Hopper FP8 GEMM with groupwise scaling.
Documentation updates:
- Quickstart - instantiating a Blackwell block-scaled GEMM.
- Detailed Blackwell block-scaled GEMM functionality documentation
- A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

Assets 2

0 Join discussion

18 Jan 15:07

hwu36

v3.7.0

b78588d

CUTLASS 3.7.0

A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
Enabled high precision accumulation for Hopper FP8 Sparse GEMM.

Assets 2

0 Join discussion

25 Dec 22:19

hwu36

v3.6.0

bf9da7b

CUTLASS 3.6.0

Hopper structured sparse GEMM.
- FP16
- FP8
- INT8
- TF32
A refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. Now the 3.x convolution API is no longer considered as a beta API.
An improved mixed input GEMM and a lookup table implementation for INT4xFP8 scale-only mode.
EVT nodes for Top-K selection and softmax and GEMM example using those.
Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
A SIMT-enabled pointer-array epilogue.
A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
A new hardware support for comparisons and computations of cutlass::bfloat16_t
Fixed use of isnan on Windows for half_t.

Assets 2

2 Join discussion

29 Aug 20:15

hwu36

v3.5.1

f7b19de

CUTLASS 3.5.1

Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
Exposure of L2 cache_hints in TMA copy atoms
Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48.
TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
A new GemmSparseUniversal API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.
CUDA host adapter extensions to support TMA descriptor construction driver APIs.
Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
Support for residual add (beta != 0) in convolution kernels.
A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
Better support for MSVC as a host compiler.
Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.

Assets 2

0 Join discussion

12 Apr 01:40

hwu36

v3.5.0

7d49e6c

CUTLASS 3.5.0

Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
- Support for Fprop, Dgrad, and Wgrad algorithms.
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
Extensions to CuTe to support L2 prefetching and TMA store+reductions.
Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
Fixes to greatly reduce build warnings.
Updates and bugfixes from the community (thanks!)

Assets 2

0 Join discussion

15 Feb 21:03

hwu36

v3.4.1

bbe579a

CUTLASS 3.4.1

Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs.
Updates and bugfixes from the community (thanks!).

Assets 2

0 Join discussion

16 Jan 22:39

hwu36

v3.4.0

751eb9a

CUTLASS 3.4.0

Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.

Assets 2

1 Join discussion

06 Dec 01:55

hwu36

v3.3.0

a75b4ac

CUTLASS 3.3.0

New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
Support for Clang as a host compiler.
Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

Assets 2

0 Join discussion

26 Oct 18:17

hwu36

v3.2.2

44c704e

CUTLASS 3.2.2

Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.

Assets 2

1 Join discussion

26 Sep 21:47

hwu36

v3.2.1

5cd735c

CUTLASS 3.2.1

Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
SM80 EVT support in C++ and Python.
Other SM90 epilogue improvements.
Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
SM90 TF32 kernel improvements for all layouts.
SM90 rasterization direction support in the CUTLASS profiler.
Improvement for CUTLASS profiler build times.
Remove Python-C++ bindings.

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: NVIDIA/cutlass

CUTLASS 3.8.0

CUTLASS 3.7.0

CUTLASS 3.6.0

CUTLASS 3.5.1

CUTLASS 3.5.0

CUTLASS 3.4.1

CUTLASS 3.4.0

CUTLASS 3.3.0

CUTLASS 3.2.2

CUTLASS 3.2.1