Skip to content

Commit 656814d

Browse files
add perf benchmarks to float8 training with rowwise scaling
1 parent 4780e10 commit 656814d

File tree

1 file changed

+35
-0
lines changed

1 file changed

+35
-0
lines changed

torchao/float8/README.md

+35
Original file line numberDiff line numberDiff line change
@@ -269,3 +269,38 @@ python test/float8/test_fsdp2/test_fsdp2.py
269269
# make sure to turn on torch.compile to get the best performance
270270
./benchmarks/float8/bench_linear_float8.py -o ../tmp/test.txt --compile
271271
```
272+
273+
### Training benchmarks
274+
275+
[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance, for both rowwise
276+
and tensorwise scaling. The training benchmarks were all run using:
277+
278+
- Single-node training on 8xH100 GPUs
279+
- Batch size 1
280+
- Sequence length 8192
281+
- Steps 100
282+
- `torch.compile`
283+
- FSDP2
284+
285+
| Model | Scaling | Activation checkpointing | Median tokens/second | Peak Memory (GB) |
286+
| ------------- | ------------ | ------------------------ | ------------------------- | ---------------- |
287+
| Llama3-8b | none (bf16) | per op SAC | 6019 | 47.65 |
288+
| Llama3-8b | tensorwise | per op SAC | 7190 | 47.77 |
289+
| Llama3-8b | rowwise | per op SAC | 6649 | 47.79 |
290+
291+
In these benchmarks tensorwise scaling achieved ~8% higher tokens/second over rowwise scaling. However, it is important to note
292+
that rowwise scaling has been shown to yield improvments in training loss/accuracy due to reduced quantization error, particularly
293+
when training large models for many steps.
294+
295+
**Reproducing training benchmarks**
296+
To reproduce these benchmarks, you can follow these steps:
297+
298+
1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
299+
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
300+
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
301+
3. From the `torchao/float8/benchmarking/` directory, you can run the following commands to reproduce the benchmarks above:
302+
- bf16 + compile: `TORCHTITAN_ROOT=<path> ./float8_training_benchmark.sh`
303+
- float8 tensorwise: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE="tensorwise" ./float8_training_benchmark.sh`
304+
- float8 rowwise: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE="rowwise" ./float8_training_benchmark.sh`
305+
306+
See the float8 training benchmarking [guide](.torchao/float8/benchmarking/README.md) for more details.

0 commit comments

Comments
 (0)