Add `ReplaceAttentionMaskValue` graph surgeon, Bert QNN EP example #1597

jambayk · 2025-02-05T20:04:49Z

Describe your changes

New graph surgeon that replaces the default attention mask value (transformers uses min float value) with a different value. The min float value does not quantize well and leads to -inf when dequantized. We use a value like -10000 to make it easier to quantize.
Bert example for qnn ep added. Accuracy is comparable to float model.
- Latency on cpu: ~900ms
- Latency on npu (Snapdragon XElite): ~8ms

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

jambayk added 4 commits February 5, 2025 08:05

replace attention mask value

2aaf47e

bert example

65bbeae

ut + doc

f19f271

doc fix

2af96a9

shaahji approved these changes Feb 5, 2025

View reviewed changes

jambayk merged commit e2ae2e0 into main Feb 5, 2025
24 checks passed

jambayk deleted the jambayk/mask-surgeon branch February 5, 2025 21:43