You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Llama3-8b | tensorwise | per op SAC | 3217.4 | 75.47 |
288
+
| Llama3-8b | rowwise | per op SAC | 2838.1 | 75.55 |
289
+
290
+
In these benchmarks tensorwise scaling achieved ~13% higher tokens/second over rowwise scaling. However, it is important to note
291
+
that rowwise scaling has been shown to yield improvments in training loss/accuracy due to reduced quantization error, particularly
292
+
when training large models for many steps.
293
+
294
+
**Reproducing training benchmarks**
295
+
To reproduce these benchmarks, you can follow these steps:
296
+
297
+
1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
298
+
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
299
+
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
300
+
2. From the torchtitan root directory, you can run the following command to reproduce the benchmarks. Note
301
+
- Run float8 training with tensorwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8"`
302
+
- Run float8 training with rowwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8" --float8.recipe_name="rowwise"`
303
+
304
+
Benchmark results were calculated by averaging the tokens/second over the first 100 training steps, excluding step 1 which includes
305
+
initialization overhead and is always much slower than steady state training.
0 commit comments