You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Llama3-8b | tensorwise | per op SAC | 3217.4 | 75.47 |
288
+
| Llama3-8b | rowwise | per op SAC | 2838.1 | 75.55 |
289
+
290
+
As a rule of thumb, tensorwise scaling is more performant, however, rowwise scaling has been shown to yield improvments
291
+
in training loss/accuracy due to reduced quantization error, particularly when training large models for many steps.
292
+
293
+
**Reproducing training benchmarks**
294
+
To reproduce these benchmarks, you can follow these steps:
295
+
296
+
1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
297
+
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
298
+
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
299
+
2. From the torchtitan root directory, you can run the following command to reproduce the benchmarks. Note
300
+
- Run float8 training with tensorwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8"`
301
+
- Run float8 training with rowwise scaling: `NGPU=8 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.steps=100 --training.batch_size=2 --training.compile --model.converters="float8" --float8.recipe_name="rowwise"`
302
+
303
+
Benchmark results were calculated by averaging the tokens/second over the first 100 training steps, excluding step 1 which includes
304
+
initialization overhead and is always much slower than steady state training.
0 commit comments