-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR]: prepare_qnn_config should not overwrite op_types_to_quantize if explicitly set #1552
Comments
Hi, do you have an example use case where you want to only quantize some ops? When using QNN EP, the entire model needs to be to fully quantized so that the whole model runs in a single qnn graph. Otherwise, I think it gets split into multiple subgraphs or becomes incompatible with qnn ep. Is your intention to split the model on purpose? Regardless, I think it might be better to ask for this feature in onnxruntime |
I am quantizing unet of stable diffusion 2.1 and I found that it will only work fine when I don't quantize the constant ops. From log, it looks like everything is running in qnn ep
From what I am understanding https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/SupportedOps.html we could run ops in qnn ep without quantizing in fp16 as long as it is supported? So it is acceptable that we do not quantize everything. Yes, I will create a pr there |
I see. Thanks for the context. For the time being, what if you apply the [peephole optimizer](https://microsoft.github.io/Olive/how-to/configure-workflows/model-opt-and-transform/onnx.html#peeophole-optimizer (needs Looks like what's happening is the constants tensors get weight quantized but the output of the constants again gets activation quantized. |
Look like not working.. I compare the number of initializers with and without peephole, they are both 4930 |
Thanks for checking. btw, did you have onnxoptimizer installed. My response had originally said onnxscript instead of onnxoptimizer. just wanted to make sure, you saw the updated response. |
Could you try another thing: in your quantization pass config, please set |
Oh.. it is working now. Not sure why doc said it is default true but I have to set it explicitly in json. If it is preprocessed, then there is no constant op any more. Peephole is not needed. At least this is not needed for my case.. So it depends on Onnxruntime team to see if this
makes sense. |
great! thanks for checking. I set the default to False if the ep is qnnep because I thought a preprocessed model would not be compatible with qnn ep Olive/olive/passes/onnx/quantization.py Line 611 in 28beb3f
But I have found recently that preprocessing is needed most of the times. Along with the constant folding, the quantizer also needs a shape inferred model which is also done by the preprocessing. I am planning to remove this False default for qnn eps. |
…nn ep (#1565) ## Describe your changes `quant_preprocess` was originally set to `False` by default because I thought it might not be compatible with QNN EP. However, it is needed for most models for the quantization to work properly: - The quantizer needs a shape inferred model to quantize some tensors. - Model quality is bad if constants are not made into initializers (see #1552). ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
I found that I could not use Add and Softmax in the text encoder of stable diffusion, do you have any idea about this? @jambayk |
Sorry, I am not familiar with this problem. Do you mean that add and softmax in the text encoder doesn't work with qnn ep? |
I mean the accuracy drops when I tested the qdq model via cpu. I am wondering if there is some optimization technique to make them more accurate? Like for Constant op, we could use preprocess to make them into initializer |
I see. I don't think the constant->initializer change can be applied to other ops. The constant fix worked since we are avoiding quantizing the same tensor twice. but for other operators like add, we have to quantize the op if we want to run it on the NPU as a single graph. You could manually set tensor overrides as described in https://github.com/microsoft/onnxruntime/blob/8db97a68f2629aa32a3ab318e741555f72151aca/onnxruntime/python/tools/quantization/quantize.py#L179 but that might be too complicated. |
Proposal Summary
I suggest we add the logic here after
Olive/olive/passes/onnx/quantization.py
Line 488 in c1e1365
What component(s) does this request affect?
The text was updated successfully, but these errors were encountered: