Skip to content

Commit 0e78699

Browse files
add perf benchmarks to float8 training with rowwise scaling
1 parent 4780e10 commit 0e78699

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

torchao/float8/README.md

+34
Original file line numberDiff line numberDiff line change
@@ -269,3 +269,37 @@ python test/float8/test_fsdp2/test_fsdp2.py
269269
# make sure to turn on torch.compile to get the best performance
270270
./benchmarks/float8/bench_linear_float8.py -o ../tmp/test.txt --compile
271271
```
272+
273+
### Training benchmarks
274+
275+
[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance, for both rowwise
276+
and tensorwise scaling. The training benchmarks were all run using:
277+
278+
- Single-node training on 8xH100 GPUs
279+
- Batch size 2
280+
- Sequence length 8192
281+
- Steps 100
282+
- `torch.compile`
283+
- FSDP2
284+
285+
| Model | Scaling | Activation checkpointing | Average tokens/second | Peak Memory (GB) |
286+
| ------------- | ----------- | ------------------------ | ------------------------- | ---------------- |
287+
| Llama3-8b | tensorwise | per op SAC | 3217.4 | 75.47 |
288+
| Llama3-8b | rowwise | per op SAC | 2838.1 | 75.55 |
289+
290+
In these benchmarks tensorwise scaling achieved ~13% higher tokens/second over rowwise scaling. However, it is important to note
291+
that rowwise scaling has been shown to yield improvments in training loss/accuracy due to reduced quantization error, particularly
292+
when training large models for many steps.
293+
294+
**Reproducing training benchmarks**
295+
To reproduce these benchmarks, you can follow these steps:
296+
297+
1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
298+
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
299+
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
300+
2. From the torchtitan root directory, you can run the following command to reproduce the benchmarks. Note
301+
- Run float8 training with tensorwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8"`
302+
- Run float8 training with rowwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8" --float8.recipe_name="rowwise"`
303+
304+
Benchmark results were calculated by averaging the tokens/second over the first 100 training steps, excluding step 1 which includes
305+
initialization overhead and is always much slower than steady state training.

0 commit comments

Comments
 (0)