add perf benchmarks to float8 training with rowwise scaling

danielvegamyhre · danielvegamyhre · commit 0e78699f4277 · 2025-02-27T23:16:27.000-08:00
diff --git a/torchao/float8/README.md b/torchao/float8/README.md
@@ -269,3 +269,37 @@ python test/float8/test_fsdp2/test_fsdp2.py
 # make sure to turn on torch.compile to get the best performance
 ./benchmarks/float8/bench_linear_float8.py -o ../tmp/test.txt --compile
 ```
+
+### Training benchmarks
+
+[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance, for both rowwise
+and tensorwise scaling. The training benchmarks were all run using:
+
+- Single-node training on 8xH100 GPUs
+- Batch size 2
+- Sequence length 8192
+- Steps 100
+- `torch.compile`
+- FSDP2
+
+| Model         | Scaling     | Activation checkpointing | Average tokens/second     | Peak Memory (GB) |
+| ------------- | ----------- | ------------------------ | ------------------------- | ---------------- |
+| Llama3-8b     |  tensorwise | per op SAC               | 3217.4                    | 75.47            |
+| Llama3-8b     |  rowwise    | per op SAC               | 2838.1                    | 75.55            |
+
+In these benchmarks tensorwise scaling achieved ~13% higher tokens/second over rowwise scaling. However, it is important to note
+that rowwise scaling has been shown to yield improvments in training loss/accuracy due to reduced quantization error, particularly
+when training large models for many steps.
+
+**Reproducing training benchmarks**
+To reproduce these benchmarks, you can follow these steps:
+
+1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
+including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
+2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
+2. From the torchtitan root directory, you can run the following command to reproduce the benchmarks. Note
+    - Run float8 training with tensorwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8"`
+    - Run float8 training with rowwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8" --float8.recipe_name="rowwise"`
+
+Benchmark results were calculated by averaging the tokens/second over the first 100 training steps, excluding step 1 which includes
+initialization overhead and is always much slower than steady state training.