Skip to content

Commit a895699

Browse files
authored
Add more docs for int4_weight_only API that targets tinygemm (#469)
Summary: att, per request in #415 (comment) Test Plan: doc changes Reviewers: Subscribers: Tasks: Tags:
1 parent 739952b commit a895699

File tree

3 files changed

+14
-0
lines changed

3 files changed

+14
-0
lines changed

Diff for: torchao/quantization/README.md

+5
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,14 @@ Affine quantization refers to the type of quantization that maps from floating p
6262
### Quantization Primitives
6363
We used to have different quantize and dequantize operators for quantization with different granularities. But in the end these can all be expressed with a `block_size` argument with different settings, so we unified existing quant primitives to `choose_qparams_affine`, `quantize_affine` and `dequantize_affine` that can represent symmetric/asymmetric per tensor/channel/token/channel_group quantization, this can be used to implement the unified quantized tensor subclass.
6464

65+
Note: these primitive ops supports two "types" of quantization, distinguished by whether `zero_point` is in floating point domain or integer domain. See docstrings for `choose_qparams` for more details.
66+
6567
### Quantized Tensor Subclass
6668
We also have a unified quantized tensor subclass that implements how to get a quantized tensor from floating point tensor and what does it mean to call linear ops on an instance of the tensor, e.g. `F.linear` and `aten.addmm`, with this we could dispatch to different operators (e.g. `int4mm` op) based on device (cpu, cuda) and quantization settings (`int4`, `int8`) and also packing formats (e.g. format optimized for cpu int4 mm kernel)
6769

70+
#### Layouts
71+
We extended the `layout` concept to represent different packing formats for a tensor. `AffineQuantizedTensor` supports `plain` and `tensor_core_tiled` layout. `plain` layout is used for `int8_weight_only` and `int8_dynamic_activation_int8_weight` and also as a default layout. `tensor_core_tiled` layout is used for `int4_weight_only` quantization and is packing the weights in a format that is compatible with tinygemm [int4mm](https://github.com/pytorch/pytorch/blob/39357ba06f48cda7d293a4995aa5eba2a46598b5/aten/src/ATen/native/native_functions.yaml#L4138) kernels.
72+
6873
### Quantization Flow Example
6974
Let's use int4 weight only quantization that's targeting tinygemm int4 weight only quantized matmul
7075
as an example:

Diff for: torchao/quantization/quant_api.py

+8
Original file line numberDiff line numberDiff line change
@@ -364,6 +364,14 @@ def int4_weight_only(group_size=128, inner_k_tiles=8):
364364
Applies uint4 weight-only asymmetric per-group quantization to linear layers, using
365365
"tensor_core_tiled" layout for speedup with tinygemm kernel
366366
367+
Note:
368+
This is targeting `tinygemm` int4mm kernel (`torch.ops.aten._weight_int4pack_mm`), the main difference
369+
of quantization algorithm compared to the more traditional type of integer quantization is the following:
370+
1). zero_point is in floating point domain instead of integer domain (`zero_point_domain`=`ZeroPointDomain.FLOAT`)
371+
2). floating point zero does not have to be exactly representable (`preserve_zero`=False in `choose_qparams_affine`)
372+
please follow the relevant code in `choose_qparams_affine`, `quantize_affine` and `dequantize_affine`
373+
to learn about how the quantization parameters are chosen and how the Tensor is quantized/dequantized for tinygemm
374+
367375
Args:
368376
`group_size`: parameter for quantization, controls the granularity of quantization, smaller
369377
size is more fine grained, choices are [256, 128, 64, 32]

Diff for: torchao/quantization/quant_primitives.py

+1
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,7 @@ def _dequantize_affine(
324324
dequant = dequant * scale
325325
else:
326326
assert zero_point_domain == ZeroPointDomain.FLOAT.name, f"Unexpected zero point domain: {zero_point_domain}"
327+
# TODO: this seems to be a detail for tinygemm (converting from uint to int, probably need to refactor this)
327328
mid_point = (quant_max + quant_min + 1) / 2
328329
# This should allocate new memory and avoid input modification
329330
dequant = input - mid_point

0 commit comments

Comments
 (0)