@@ -5,6 +5,7 @@ author: "AMD and Embedded LLM"
5
5
image : /assets/figures/ptpc/PTPC-tumbnail.png
6
6
thumbnail-img : /assets/figures/ptpc/PTPC-tumbnail.png
7
7
share-img : /assets/figures/ptpc/PTPC-tumbnail.png
8
+ math : true
8
9
---
9
10
10
11
** TL;DR** : vLLM on AMD ROCm now has better FP8 performance!
@@ -57,15 +58,15 @@ This insight led to a dual-granularity approach:
57
58
The illustration shows two quantization approaches:
58
59
59
60
** Tensor Dimensions (Both Methods):**
60
- - ** X ** : Input activation tensor (T×Ci )
61
- - ** W ** : Weight tensor (Ci×Co )
62
- - ** T ** : Token sequence length
63
- - ** Ci/Co ** : Input/output channels
64
- - ** \* ** : Matrix multiplication
61
+ - ** $X$ ** : Input activation tensor ($T \times C_i$ )
62
+ - ** $W$ ** : Weight tensor ($C_i \times C_o$ )
63
+ - ** $T$ ** : Token sequence length
64
+ - ** $C_i/C_o$ ** : Input/output channels
65
+ - ** $ * $ ** : Matrix multiplication
65
66
66
67
** Scaling Factors:**
67
- - ** Top (Per-Tensor)** : Single scalars ΔX [ 1] and ΔW [ 1] for entire tensors
68
- - ** Bottom (PTPC)** : Vector ΔX [ T×1 ] with one scale per token and ΔW [ 1×Co ] with one scale per input channel
68
+ - ** Top (Per-Tensor)** : Single scalars $\Delta_X [ 1] $ and $\Delta_W [ 1] $ for entire tensors
69
+ - ** Bottom (PTPC)** : Vector $\Delta_X [ T \times 1 ] $ with one scale per token and $\Delta_W [ 1 \times C_o ] $ with one scale per input channel
69
70
70
71
This granular scaling approach allows PTPC-FP8 to achieve accuracy close to BF16 while maintaining the speed and memory benefits of 8-bit computation.
71
72
0 commit comments