-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QuantizedLinearNotImplementedError when inference with Int8DynamicActivationInt4WeightConfig #1909
Comments
Int8DynamicActivationInt4Weight is supposed to be lowered to executorch to get speedup, but we also support ao/torchao/quantization/quant_api.py Line 589 in 6b76adb
|
@jerryzh168 I changed to
It still hits the exception. |
@goldhuang can you paste the error message? |
@jerryzh168 |
@jerryzh168 It's still not working after I upgraded to newer torch. Int8DynamicActivationInt4WeightConfig is basically not working, as it's actually running in bf16 (the original dtype of the model and input) with extra dequantize(). |
@jerryzh168 |
I see, then yeah |
I got this after changing this line to Could you guide me on the error above? |
oh sorry, here is how you use cutlass s8s4 kernel: ao/test/dtypes/test_affine_quantized.py Line 70 in d456ea1
|
@jerryzh168 Thanks for sharing the test code. But the code is obsolete and cannot run with latest torchao code because of the redesign of the quantize_(). |
I see, cc @alexsamardzic is there any other flag we need to enable to include cutlass kernel in build? I can test out the api a bit later. yeah the default layout variant for int8 activation + int4 weight is only for executorch
if you are referring to the default layout, then yeah, it's just for ET, but for cutlass one I think has some speedup in A100: #880 (and float8 is probably better in H100) |
In order to use CUTLASS-based kernel for W4A8, the
Also, please add following imports:
As, to my knowledge, there is no support either in Triton for S8/S4 GEMM, or in CUTLASS auto-tuning "back-end" for Inductor, the same CUTLASS-based W4A8 CUDA kernel from torchao should be executed for above config in both eager and compiled mode. The speed-up expected over non-quantized case is not particularly significant, I'm at the moment looking into some improvements; also, this CUTLASS-based kernel has some caveats, for example group quantization is not supported. As @jerryzh168 mentioned above: this kernel is really just for Ampere generation of GPUs (the kernel will be compiled if |
Hi, my inference code hits exception here https://github.com/pytorch/ao/blob/main/torchao/dtypes/affine_quantized_tensor_ops.py#L228
when I use Int8DynamicActivationInt4Weight. The inference is slower than bf16 inference, as it falls back and dequantized back to bf16.
I'm with torch2.5.0+cu124.
It will hit the exception too when I disable torch.compile().
The text was updated successfully, but these errors were encountered: