-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add npu support for LLM.int8 forward #1534
base: multi-backend-refactor
Are you sure you want to change the base?
Add npu support for LLM.int8 forward #1534
Conversation
colidx_tmp = torch.unique(outliers_col_idx) | ||
colidx = colidx_tmp[colidx_tmp != -1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an optimization this can probably be avoided when threshold==0.0
and moved into the condition below.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks! I'll provide a few comments. I also want to remind that we're quite close on moving to torch.library and custom ops for device dispatching and will provide more info soon on porting over to that. It should be a fairly simple process! |
@@ -69,7 +181,7 @@ def int8_linear_matmul( | |||
out: Optional[torch.Tensor] = None, | |||
dtype=torch.int32, | |||
) -> torch.Tensor: | |||
raise NotImplementedError | |||
return Int8AB(A, B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting and clever! While this does break the expected API as this isn't returning a Tensor (or performing any operations really), I can completely understand why it is done this way. I think this will be OK for right now and we'll make the interface better in this regard later on.
# `torch.Tensor.to(<int num>)` is not supported by `torch_npu` (see this [issue](https://github.com/Ascend/pytorch/issues/16)). | ||
if isinstance(device, int): | ||
device = f"npu:{device}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this require a bump in the minimum torch_npu to support now?
What does this PR do?
LLM.int8 inference (forward-only)
int8_vectorwise_dequant
(AscendC version WIP).npu_quant_matmul
for NPU-optimized matmul+dequant.NF4 memory fix
Notes
Collaborators
@ji-huazhong @Ginray @MatrixPlayer
cc @Titus-von-Koeller @matthewdouglas