Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 4bits channel-wised quantization capability for MatMulNbits Op #631

Draft
wants to merge 2 commits into
base: ovep-develop
Choose a base branch
from

Conversation

bopeng1234
Copy link

@bopeng1234 bopeng1234 commented Mar 31, 2025

Description

add 4bits channel-wised quantization capability for MatMulNbits Op for phi3 model, it optimized the TPS on Intel NPU

JIRA - https://jira.devtools.intel.com/browse/EISW-163602

Motivation and Context

As Intel's NPU support to LLM shows https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support

if we want to use onnx quantized model in intel NPU, like phi3, the quantized model need to meet two requirements:

  1. symmetric, zp=0
  2. channel wised quantize, block_size = K

So this PR's changes is to enable the channel wised quantize, and symmetric.

Quantize to int4 [-8, 7], we tested it with onnxruntime_genai changes (we created a PR to onnxruntime_genai too to support this extra args, microsoft/onnxruntime-genai#1362).

command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1

normally without channel-wised quantize model, the phi3 with NPUW, runs about 4000ms per token when kv cache model.
apply this PR, and phi3 with NPUW, runs about 150ms per token, speed up 20+ times.

@sfatimar
Copy link

sfatimar commented Apr 1, 2025

These file changes need to be sent directly to Micorsoft. If you are from Intel please contact @ankitm3k

@ankitm3k
Copy link

ankitm3k commented Apr 4, 2025

@bopeng1234 Please file a JIRA with all your finding and rebase this branch to the source branch asap.

Kindly confirm that the change for enabling CW quantization is valid for quant_format=QuantFormat.QDQ also. Kindly share the recipe to create a int4 quantized model with me here or in the JIRA for a reproducer.

…r phi3 model, it optimized the TPS on Intel NPU
@bopeng1234
Copy link
Author

@ankitm3k , already filed a JIRA EISW-163602 and rebased to the source branch.

added the QDQ format, the command to create a NPU-friendly int4 CW quantized ONNX model is also attached in the JIRA

@bopeng1234 bopeng1234 marked this pull request as draft April 8, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants