You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Llama3-8b | none (bf16) | per op SAC | 6019 | 47.65 |
288
+
| Llama3-8b | tensorwise | per op SAC | 7190 | 47.77 |
289
+
| Llama3-8b | rowwise | per op SAC | 6649 | 47.79 |
290
+
291
+
In these benchmarks tensorwise scaling achieved ~8% higher tokens/second over rowwise scaling. However, it is important to note
292
+
that rowwise scaling has been shown to yield improvments in training loss/accuracy due to reduced quantization error, particularly
293
+
when training large models for many steps.
294
+
295
+
**Reproducing training benchmarks**
296
+
To reproduce these benchmarks, you can follow these steps:
297
+
298
+
1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
299
+
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
300
+
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
301
+
3. From the `torchao/float8/benchmarking/` directory, you can run the following commands to reproduce the benchmarks above:
0 commit comments