Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR]: prepare_qnn_config should not overwrite op_types_to_quantize if explicitly set #1552

Open
7 tasks
xieofxie opened this issue Jan 15, 2025 · 12 comments
Open
7 tasks
Labels
enhancement New feature or request

Comments

@xieofxie
Copy link

Proposal Summary

I suggest we add the logic here after

run_config = {k: v for k, v in inspect.getmembers(qnn_config) if not k.startswith("_")}

if config["op_types_to_quantize"]:
    run_config["op_types_to_quantize"] = config["op_types_to_quantize"]

What component(s) does this request affect?

  • OliveModels
  • OliveSystems
  • OliveEvaluator
  • Metrics
  • Engine
  • Passes
  • Other
@xieofxie xieofxie added the enhancement New feature or request label Jan 15, 2025
@jambayk
Copy link
Contributor

jambayk commented Jan 21, 2025

Hi, do you have an example use case where you want to only quantize some ops?

When using QNN EP, the entire model needs to be to fully quantized so that the whole model runs in a single qnn graph. Otherwise, I think it gets split into multiple subgraphs or becomes incompatible with qnn ep. Is your intention to split the model on purpose?

Regardless, I think it might be better to ask for this feature in onnxruntime get_qnn_qdq_config https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/execution_providers/qnn/quant_config.py#L43. We can update the call to this according in olive. This way, we can keep the qnn ep related logic in the ort quantization code instead of doing the op_types_to_quantize overwrite in Olive, especially since we don't know if this should be allowed or not.

@xieofxie
Copy link
Author

I am quantizing unet of stable diffusion 2.1 and I found that it will only work fine when I don't quantize the constant ops. From log, it looks like everything is running in qnn ep

2025-01-22 14:25:50.2318910 [I:onnxruntime:, qnn_execution_provider.cc:788 onnxruntime::QNNExecutionProvider::GetCapability] Number of partitions supported by QNN EP: 1, number of nodes in the graph: 6201, number of nodes supported by QNN: 6201
2025-01-22 14:25:53.3775343 [I:onnxruntime:, qnn_execution_provider.cc:788 onnxruntime::QNNExecutionProvider::GetCapability] Number of partitions supported by QNN EP: 1, number of nodes in the graph: 7101, number of nodes supported by QNN: 7101

From what I am understanding https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/SupportedOps.html we could run ops in qnn ep without quantizing in fp16 as long as it is supported? So it is acceptable that we do not quantize everything.

Yes, I will create a pr there

@jambayk
Copy link
Contributor

jambayk commented Jan 22, 2025

I see. Thanks for the context.

For the time being, what if you apply the [peephole optimizer](https://microsoft.github.io/Olive/how-to/configure-workflows/model-opt-and-transform/onnx.html#peeophole-optimizer (needs onnxoptimizer installed) to do constant folding before the quantization step. This step should convert the constants into initializers, so they won't get separately quantized.

Looks like what's happening is the constants tensors get weight quantized but the output of the constants again gets activation quantized.

@xieofxie
Copy link
Author

Look like not working.. I compare the number of initializers with and without peephole, they are both 4930

@jambayk
Copy link
Contributor

jambayk commented Jan 23, 2025

Thanks for checking. btw, did you have onnxoptimizer installed. My response had originally said onnxscript instead of onnxoptimizer. just wanted to make sure, you saw the updated response.

@jambayk
Copy link
Contributor

jambayk commented Jan 23, 2025

Could you try another thing: in your quantization pass config, please set "quant_preprocess": true.

@xieofxie
Copy link
Author

Oh.. it is working now. Not sure why doc said it is default true but I have to set it explicitly in json. If it is preprocessed, then there is no constant op any more. Peephole is not needed.

At least this is not needed for my case.. So it depends on Onnxruntime team to see if this

From what I am understanding https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/SupportedOps.html we could run ops in qnn ep without quantizing in fp16 as long as it is supported? So it is acceptable that we do not quantize everything.

makes sense.

@jambayk
Copy link
Contributor

jambayk commented Jan 23, 2025

great! thanks for checking.

I set the default to False if the ep is qnnep because I thought a preprocessed model would not be compatible with qnn ep

config["quant_preprocess"].default_value = False

But I have found recently that preprocessing is needed most of the times. Along with the constant folding, the quantizer also needs a shape inferred model which is also done by the preprocessing.

I am planning to remove this False default for qnn eps.

jambayk added a commit that referenced this issue Jan 24, 2025
…nn ep (#1565)

## Describe your changes
`quant_preprocess` was originally set to `False` by default because I
thought it might not be compatible with QNN EP. However, it is needed
for most models for the quantization to work properly:
- The quantizer needs a shape inferred model to quantize some tensors.
- Model quality is bad if constants are not made into initializers (see
#1552).

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.
- [ ] Is this PR including examples changes? If yes, please remember to
update [example
documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md)
in a follow-up PR.

## (Optional) Issue link
@xieofxie
Copy link
Author

I found that I could not use Add and Softmax in the text encoder of stable diffusion, do you have any idea about this? @jambayk

@jambayk
Copy link
Contributor

jambayk commented Jan 24, 2025

Sorry, I am not familiar with this problem. Do you mean that add and softmax in the text encoder doesn't work with qnn ep?

@xieofxie
Copy link
Author

I mean the accuracy drops when I tested the qdq model via cpu. I am wondering if there is some optimization technique to make them more accurate? Like for Constant op, we could use preprocess to make them into initializer

@jambayk
Copy link
Contributor

jambayk commented Jan 27, 2025

I see. I don't think the constant->initializer change can be applied to other ops. The constant fix worked since we are avoiding quantizing the same tensor twice. but for other operators like add, we have to quantize the op if we want to run it on the NPU as a single graph. You could manually set tensor overrides as described in https://github.com/microsoft/onnxruntime/blob/8db97a68f2629aa32a3ab318e741555f72151aca/onnxruntime/python/tools/quantization/quantize.py#L179 but that might be too complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants