-
-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036
base: main
Are you sure you want to change the base?
Conversation
Nice! |
BTW, are there any tools available that can automatically resolve lint issues? vllm/model_executor/layers/quantization/gptq_bitblas.py:28:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:28:8: F811 Redefinition of unused `bitblas` from line 21
vllm/model_executor/layers/quantization/gptq_bitblas.py:29:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:66:81: E501 Line too long (107 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:172:81: E501 Line too long (85 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:222:81: E501 Line too long (105 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:230:81: E501 Line too long (89 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:233:81: E501 Line too long (110 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:236:81: E501 Line too long (99 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:242:81: E501 Line too long (84 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:253:81: E501 Line too long (94 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:414:81: E501 Line too long (86 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:417:29: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:420:17: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:427:81: E501 Line too long (103 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:433:81: E501 Line too long (116 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:454:81: E501 Line too long (82 > 80) |
|
@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels or using the "gptq_marlin" interface to take advantage of Marlin kernels? This will be important to compare with the current baseline we consider for GPTQ models in vLLM |
Thanks, it utilized exllamav2 during our benchmarking at that time; we will examine the comparison with the Marlin kernel. |
Hi all, I recently update the the supports for 1.58bits model and related bitblas inference kernel for vllm.
|
We will soon do benchmarking with marlin, and looks like the docs build failed because of the dependency for bitblas, do you have any ideas to fix this issue? should we put the bitblas requirements to the doc/requirements or is there some options to skip this dependency? @mgoin |
I think this PR is ready for review. Here is a summary of this update: We now support BitBLAS as a quantized backend and can use vLLM to serve pretrained models from Hugging Face (in GPTQ, BitNet, or BitBLAS format) with the BitBLAS inference kernel. We briefly tested the performance using Marlin with the throughput benchmark scripts provided by vLLM on A100: python benchmark_throughput.py --backend vllm --num-prompts 1 --input-len 32 --output-len 512 --max-model-len 1024 --model "hxbgsyxh/llama-13b-4bit-g-1-bitblas" --quantization "bitblas"
python benchmark_throughput.py --backend vllm --num-prompts 1 --input-len 32 --output-len 512 --max-model-len 1024 --model "hxbgsyxh/llama-13b-4bit-g-1-marlin" --quantization "marlin" The performance results are:
Some notes:
Moreover, this PR also adds support for the 1.58-bit BitNET model.
All correctness checks have been evaluated with the following: from conftest import VllmRunner
import torch
# Test BitNET model with BitBLAS quantization
with VllmRunner(
"hxbgsyxh/bitnet_b1_58-3B",
dtype="half",
quantization="bitnet_bitblas",
enforce_eager=True,
gpu_memory_utilization=0.5,
) as bitnet_model:
bitbnet_outputs = bitnet_model.generate_greedy(
["Hi, tell me about Microsoft?"], max_tokens=128
)
print("bitnet_bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1])
# Test another BitBLAS model
with VllmRunner(
"hxbgsyxh/bitnet_b1_58-3B_bitblas",
dtype="half",
quantization="bitblas",
enforce_eager=True,
) as bitnet_model:
torch.cuda.profiler.start()
bitbnet_outputs = bitnet_model.generate_greedy(
["Hi, tell me about Microsoft?"], max_tokens=128
)
torch.cuda.profiler.stop()
print("bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1])
# Test GPTQ quantized model
with VllmRunner(
"hxbgsyxh/opt-125m-4bit-128g",
dtype="half",
quantization="gptq",
enforce_eager=True,
) as marlin_model:
torch.cuda.profiler.start()
bitbnet_outputs = marlin_model.generate_greedy(
["Hi, tell me about Microsoft?"], max_tokens=128
)
torch.cuda.profiler.stop()
print("bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1])
torch.compiler.reset()
# Test GPTQ quantized model with BitBLAS
with VllmRunner(
"hxbgsyxh/opt-125m-4bit-128g-bitblas",
dtype="half",
quantization="bitblas",
enforce_eager=True,
) as bitblas_model:
torch.cuda.profiler.start()
bitbnet_outputs = bitblas_model.generate_greedy(
["Hi, tell me about Microsoft?"], max_tokens=128
)
torch.cuda.profiler.stop()
print("bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1]) |
any questions are welcome and please take a review when you have a moment :) @mgoin @robertgshaw2-neuralmagic |
Thanks for all the work @LeiWang1999! I have a few high-level thoughts first on how to make landing this more straightforward:
|
Thanks for your suggestions, @mgoin!
with VllmRunner(
"hxbgsyxh/llama-13b-4bit-g-1", # model with gptq format
dtype="half",
quantization="bitblas",
enforce_eager=True,
) as bitblas_model:
torch.cuda.profiler.start()
bitbnet_outputs = bitblas_model.generate_greedy(
["Hi, tell me about microsoft?"], max_tokens=128
)
torch.cuda.profiler.stop()
print("bitblas:")
print(bitbnet_outputs[0][0])
print(bitbnet_outputs[0][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting it up! I left a first round of clear nits/issues and will do a more in-depth pass later. There seem to be a lot of various formatting changes, for some reason
Hi @mgoin , are there any further updates or actions we should take? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LeiWang1999 I'm very sorry for the delay, I lost track of this PR and didn't catch your ping.
There has been an ongoing refactor for quantization methods to use a new set of vLLMParameters (see gptq_marlin PR #7281) to simplify weight loading, but we could delay this for bitblas to make it easier to land this initial PR.
Also as mentioned in #7725 (comment), there will be a few merge conflicts with main.
If/when you have bandwidth to finish this out, I promise to get this over the line asap. Please let me know!
if layer.bitblas_state == GPTQBitBLASState.REPACK: | ||
layer.bitblas_state = GPTQBitBLASState.READY | ||
|
||
# Newly generated tensors need to replace existing tensors that are | ||
# already registered as parameters by vLLM (and won't be freed) | ||
def replace_tensor(name, new_t): | ||
# It is important to use copy_() here since it ensures | ||
# the same buffer is reused | ||
getattr(layer, name).copy_( | ||
new_t.view(getattr(layer, name).dtype).view( | ||
getattr(layer, name).shape)) | ||
del new_t | ||
|
||
# Repack weights | ||
bitblas_qweight, bitblas_scales, bitblas_qzeros = ( | ||
self.repack_bitblas_from_gptq( | ||
layer.qweight, | ||
layer.scales, | ||
layer.qzeros, | ||
)) | ||
replace_tensor("qweight", bitblas_qweight) | ||
replace_tensor("scales", bitblas_scales) | ||
replace_tensor("qzeros", bitblas_qzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to move this into a process_weights_after_loading
function we have specifically for this purpose, example in gptq_marlin.py
def process_weights_after_loading(self, layer: torch.nn.Module) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll take a look. I'm currently working on the stream-k template in bitblas :)
This pull request has merge conflicts that must be resolved before it can be |
@mgoin , apologies for the delay in updates, over the last two months, we've made several significant improvements:
Given the recent public interest in the bitnet.cpp project (which runs bitnet on cpu), I think it would be a great opportunity to advance the integration of VLLM with BitNet (issue #7725) with gpu. :) Would you mind take a review of this pull request? |
@LeiWang1999 thanks for the ping and updates, excited to review! |
This pull request has merge conflicts that must be resolved before it can be |
04b7a43
to
2a48e04
Compare
…omputation - BitBLAS Signed-off-by: xinyuxiao <[email protected]>
9c82e95
to
722fff6
Compare
Signed-off-by: alex_xiao <[email protected]>
Signed-off-by: alex_xiao <[email protected]>
Signed-off-by: alex_xiao <[email protected]>
Signed-off-by: alex_xiao <[email protected]>
Signed-off-by: alex_xiao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the changes in this file should be reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file also seems unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could have a general tile size to reuse rather than duplicating the arg essentially - although I understand why the _adjust_shard_indexes_for_X
impls need to be different. @dsikka could you take a look at the tile size changes here?
Hi all, this PR introduces support for the Microsoft Runtime Kernel Library to enhance our low precision computation capabilities.
Brief Introduction of BitBLAS
BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the$W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$ .$W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{UINT4}A_{FP16}$ in GPTQ, the $W_{INT2}A_{FP16}$ in BitDistiller, the $W_{INT2}A_{INT8}$ in BitNet-b1.58.
BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the
PR Overview
This PR integrates BitBLAS into vLLM by adding examples of its usage. We provide two forms:
Below are the benchmarking results that we evaluated several months ago:
TODO ITEMS
Any feedback and suggestions to improve this integration are appreciated.