add 4bits channel-wised quantization capability for MatMulNbits Op #631

bopeng1234 · 2025-03-31T06:40:59Z

Description

add 4bits channel-wised quantization capability for MatMulNbits Op for phi3 model, it optimized the TPS on Intel NPU

JIRA - https://jira.devtools.intel.com/browse/EISW-163602

Motivation and Context

As Intel's NPU support to LLM shows https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support

if we want to use onnx quantized model in intel NPU, like phi3, the quantized model need to meet two requirements:

symmetric, zp=0
channel wised quantize, block_size = K

So this PR's changes is to enable the channel wised quantize, and symmetric.

Quantize to int4 [-8, 7], we tested it with onnxruntime_genai changes (we created a PR to onnxruntime_genai too to support this extra args, microsoft/onnxruntime-genai#1362).

command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1

normally without channel-wised quantize model, the phi3 with NPUW, runs about 4000ms per token when kv cache model.
apply this PR, and phi3 with NPUW, runs about 150ms per token, speed up 20+ times.

sfatimar · 2025-04-01T12:18:15Z

These file changes need to be sent directly to Micorsoft. If you are from Intel please contact @ankitm3k

ankitm3k · 2025-04-04T08:55:38Z

@bopeng1234 Please file a JIRA with all your finding and rebase this branch to the source branch asap.

Kindly confirm that the change for enabling CW quantization is valid for quant_format=QuantFormat.QDQ also. Kindly share the recipe to create a int4 quantized model with me here or in the JIRA for a reproducer.

…r phi3 model, it optimized the TPS on Intel NPU

bopeng1234 · 2025-04-07T05:52:08Z

@ankitm3k , already filed a JIRA EISW-163602 and rebased to the source branch.

added the QDQ format, the command to create a NPU-friendly int4 CW quantized ONNX model is also attached in the JIRA

bopeng1234 mentioned this pull request Mar 31, 2025

add extra_options use_channel_wised_quantization to builder.py microsoft/onnxruntime-genai#1362

Draft

sfatimar requested a review from ankitm3k April 1, 2025 12:17

add 4bits channel-wised quantization capability for MatMulNbits Op fo…

d18b604

…r phi3 model, it optimized the TPS on Intel NPU

bopeng1234 force-pushed the ovep-develop branch from 5ce1f7b to d18b604 Compare April 7, 2025 01:54

add 4bits CW quantize for QDQ operator

3debb01

bopeng1234 marked this pull request as draft April 8, 2025 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add 4bits channel-wised quantization capability for MatMulNbits Op #631

add 4bits channel-wised quantization capability for MatMulNbits Op #631

bopeng1234 commented Mar 31, 2025 •

edited by ankitm3k

Loading

sfatimar commented Apr 1, 2025

ankitm3k commented Apr 4, 2025

bopeng1234 commented Apr 7, 2025

add 4bits channel-wised quantization capability for MatMulNbits Op #631

Are you sure you want to change the base?

add 4bits channel-wised quantization capability for MatMulNbits Op #631

Conversation

bopeng1234 commented Mar 31, 2025 • edited by ankitm3k Loading

Description

Motivation and Context

sfatimar commented Apr 1, 2025

ankitm3k commented Apr 4, 2025

bopeng1234 commented Apr 7, 2025

bopeng1234 commented Mar 31, 2025 •

edited by ankitm3k

Loading