[float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling #1793

danielvegamyhre · 2025-02-28T06:52:45Z

Summary

Add float8 training performance benchmarks for rowwise + tensorwise scaling.
Add repro steps for these benchmarks.

pytorch-bot · 2025-02-28T06:52:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1793

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 4 Pending

As of commit 4610085 with merge base 711fa08 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-02-28T14:23:31Z

torchao/float8/README.md

+- FSDP2
+
+| Model         | Scaling     | Activation checkpointing | Average tokens/second     | Peak Memory (GB) |
+| ------------- | ----------- | ------------------------ | ------------------------- | ---------------- |


we should include the baseline (bf16 + compile) here so it's clear what the speedup is from baseline

torchao/float8/README.md

vkuzo · 2025-02-28T23:58:29Z

torchao/float8/README.md

+| Llama3-8b     |  tensorwise  | per op SAC               | 7190                      | 47.77            |
+| Llama3-8b     |  rowwise     | per op SAC               | 6649                      | 47.79            |
+
+In these benchmarks tensorwise scaling achieved ~8% higher tokens/second over rowwise scaling, and ~19.5% higher than the bf16 baseline.


nit: how about

add column with "speedup over baseline" instead of only explaining it in a sentence

saying something like "rowwise scaling is better at handling outliers than tensorwise scaling, so these recipes are different points on the accuracy vs performance curve"

saying that speedups increase as M,K,N increase, and pointing to blogs such as https://pytorch.org/blog/training-using-float8-fsdp2/ where e2e speedups as high as 1.5x are quoted. This is just to clarify that the 1.2x shown here is not the max speedup - it's just the speedup given the benchmark setup.

done, let me know what you think

danielvegamyhre · 2025-03-01T00:45:13Z

Looks like test failure is related to #1799

vkuzo · 2025-03-04T20:39:40Z

Just to confirm, for tensorwise scaling, I see that https://github.com/pytorch/ao/blob/main/benchmarks/float8/training/float8_training_benchmark.sh is using recipe-lookup-by name.

Unfortunately, for tensorwise scaling, this is correct but not optimal. We also need to enable the following flags to enable float8 all-gather with FSDP2, and those flags are currently not supported when using titan's recipe-string-to-recipe lookup:

# source: https://github.com/pytorch/torchtitan/blob/main/docs/float8.md
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_train.sh --model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd --training.compile

We have pytorch/torchtitan#901 to track making the torchtitan side of this better. Any chance we can update the tensorwise benchmark to include these flags, and also call out in the table that the tensorwise recipe has float8 all-gather for FSDP?

vkuzo · 2025-03-04T20:39:59Z

torchao/float8/README.md

+- `torch.compile`
+- FSDP2
+
+| Model         | Scaling      | Activation checkpointing | Peak Memory (GB)  | Median tokens/second | Speedup over basline


nit: baseline (typo)

danielvegamyhre · 2025-03-04T20:43:33Z

Just to confirm, for tensorwise scaling, I see that https://github.com/pytorch/ao/blob/main/benchmarks/float8/training/float8_training_benchmark.sh is using recipe-lookup-by name.

Unfortunately, for tensorwise scaling, this is correct but not optimal. We also need to enable the following flags to enable float8 all-gather with FSDP2, and those flags are currently not supported when using titan's recipe-string-to-recipe lookup:
# source: https://github.com/pytorch/torchtitan/blob/main/docs/float8.md
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_train.sh --model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd --training.compile
We have pytorch/torchtitan#901 to track making the torchtitan side of this better. Any chance we can update the tensorwise benchmark to include these flags, and also call out in the table that the tensorwise recipe has float8 all-gather for FSDP?

Sure, but I thought enabling float8 all gather would make it less of a 1:1 comparison with rowwise? Or is the goal here just to showcase the peak achievable speedup using all optimal configs for each scaling strategy?

vkuzo · 2025-03-04T20:46:30Z

Or is the goal here just to showcase the peak achievable speedup using all optimal configs for each scaling strategy?

Yes, IMO that's what this should do. We should call out any features which are not implemented (or impossible to implement), but at the end we want the best speedup with each recipe, with each appropriate knobs turned on.

danielvegamyhre · 2025-03-12T01:51:09Z

Or is the goal here just to showcase the peak achievable speedup using all optimal configs for each scaling strategy?

Yes, IMO that's what this should do. We should call out any features which are not implemented (or impossible to implement), but at the end we want the best speedup with each recipe, with each appropriate knobs turned on.

@vkuzo I managed to find a machine without perf regression and reran all benchmarks, using the optimal configs for tensorwise scaling this time, so this is ready for another look when you have a sec

vkuzo · 2025-03-12T13:01:49Z

torchao/float8/README.md

+| Model         | Scaling                           | Activation checkpointing | Peak Memory (GB)  | Median tokens/second | Speedup over baseline
+| ------------- | --------------------------------- | ------------------------ | ------------------| -------------------- | ---------------------
+| Llama3-8b     |  none (bf16)                      | per op SAC               | 47.65             |  6150                | -
+| Llama3-8b     |  tensorwise with optimal settings | per op SAC               | 47.77             |  7689.5              | 25.03%


how about something like tensorwise with float8 all-gather and rowwise with bfloat16 all-gather? "optimal settings" should be true for all the rows in this table, it's just that the actual settings change.

I thought about this but the thing is, tensorwise has 3 settings enabled, so listing all 3 would cause the column to become huge and make the table formatting clunky. So instead I listed the 3 settings in a bullet point below. What do you think?

I think it's important to specify that float8 all-gather is used for tensorwise, and not important to list out the three specific settings used to enable that feature.

vkuzo · 2025-03-12T13:02:30Z

torchao/float8/README.md

+
+| Model         | Scaling                           | Activation checkpointing | Peak Memory (GB)  | Median tokens/second | Speedup over baseline
+| ------------- | --------------------------------- | ------------------------ | ------------------| -------------------- | ---------------------
+| Llama3-8b     |  none (bf16)                      | per op SAC               | 47.65             |  6150                | -


nit: bfloat16, to match how we spell dtypes in the rest of PyTorch?

vkuzo · 2025-03-12T13:03:14Z

torchao/float8/README.md

+[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance, for both rowwise
+and tensorwise scaling. The training benchmarks were all run using:
+
+- Single-node training on 8xH100 GPUs


can we add PyTorch version, torchtitan version, torchao version? Ideally the script could display them.

vkuzo · 2025-03-12T13:03:40Z

torchao/float8/README.md

+| Llama3-8b     |  rowwise                          | per op SAC               | 47.79             |  6768                | 10.05%
+
+**Important notes**:
+- Speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ((example)[https://pytorch.org/blog/training-using-float8-fsdp2/]).


nit: "E2e speedups as high as 1.5x..."

vkuzo

looks great, thank you!

danielvegamyhre added the topic: documentation Use this tag if this PR adds or improves documentation label Feb 28, 2025

danielvegamyhre requested a review from vkuzo February 28, 2025 06:52

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 28, 2025

danielvegamyhre changed the title ~~Add perf benchmarks to float8 training with rowwise + tensorwise scaling~~ Add perf benchmarks for float8 training with rowwise + tensorwise scaling Feb 28, 2025

danielvegamyhre force-pushed the fp8readme branch 2 times, most recently from a1e5143 to 0e78699 Compare February 28, 2025 07:16

vkuzo reviewed Feb 28, 2025

View reviewed changes

torchao/float8/README.md Outdated Show resolved Hide resolved

vkuzo reviewed Feb 28, 2025

View reviewed changes

torchao/float8/README.md Outdated Show resolved Hide resolved

vkuzo reviewed Feb 28, 2025

View reviewed changes

torchao/float8/README.md Outdated Show resolved Hide resolved

danielvegamyhre changed the title ~~Add perf benchmarks for float8 training with rowwise + tensorwise scaling~~ [float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling Feb 28, 2025

danielvegamyhre force-pushed the fp8readme branch 2 times, most recently from 656814d to c227ac7 Compare February 28, 2025 23:42

vkuzo reviewed Feb 28, 2025

View reviewed changes

danielvegamyhre force-pushed the fp8readme branch from 781a2c4 to 1adf995 Compare March 4, 2025 20:34

vkuzo reviewed Mar 4, 2025

View reviewed changes

danielvegamyhre added 4 commits March 11, 2025 16:57

add perf benchmarks to float8 training with rowwise scaling

0641e74

address comment

c4e7e3b

fix typo

71ce7e6

updated perf results

9e655fc

danielvegamyhre force-pushed the fp8readme branch from ec5febb to 9e655fc Compare March 12, 2025 01:49

vkuzo reviewed Mar 12, 2025

View reviewed changes

vkuzo approved these changes Mar 12, 2025

View reviewed changes

danielvegamyhre added 2 commits March 12, 2025 08:21

add pkg versions

73d9391

more details in repro steps

4610085

danielvegamyhre merged commit 8c81863 into pytorch:main Mar 12, 2025
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling #1793

[float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling #1793

danielvegamyhre commented Feb 28, 2025

pytorch-bot bot commented Feb 28, 2025 •

edited

Loading

vkuzo Feb 28, 2025

danielvegamyhre Feb 28, 2025

vkuzo Feb 28, 2025

danielvegamyhre Mar 4, 2025

danielvegamyhre commented Mar 1, 2025

vkuzo commented Mar 4, 2025

vkuzo Mar 4, 2025

danielvegamyhre commented Mar 4, 2025

vkuzo commented Mar 4, 2025

danielvegamyhre commented Mar 12, 2025

vkuzo Mar 12, 2025 •

edited

Loading

danielvegamyhre Mar 12, 2025 •

edited

Loading

vkuzo Mar 12, 2025

vkuzo Mar 12, 2025

vkuzo Mar 12, 2025

vkuzo Mar 12, 2025

vkuzo left a comment

[float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling #1793

[float8] add perf benchmarks for float8 training with rowwise + tensorwise scaling #1793

Conversation

danielvegamyhre commented Feb 28, 2025

pytorch-bot bot commented Feb 28, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1793

⏳ No Failures, 4 Pending

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielvegamyhre commented Mar 1, 2025

vkuzo commented Mar 4, 2025

Choose a reason for hiding this comment

danielvegamyhre commented Mar 4, 2025

vkuzo commented Mar 4, 2025

danielvegamyhre commented Mar 12, 2025

vkuzo Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

danielvegamyhre Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuzo left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Feb 28, 2025 •

edited

Loading

vkuzo Mar 12, 2025 •

edited

Loading

danielvegamyhre Mar 12, 2025 •

edited

Loading