Skip to content

Commit 98175b2

Browse files
authored
Improve the docs for TransformersModel (vllm-project#14147)
Signed-off-by: Harry Mellor <[email protected]>
1 parent 4167252 commit 98175b2

File tree

1 file changed

+49
-19
lines changed

1 file changed

+49
-19
lines changed

docs/source/models/supported_models.md

+49-19
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,11 @@ Alongside each architecture, we include some popular models that use it.
1414

1515
By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
1616

17-
To determine whether a given model is supported, you can check the `config.json` file inside the HF repository.
18-
If the `"architectures"` field contains a model architecture listed below, then it should be supported in theory.
17+
To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
18+
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
19+
20+
Models do not _need_ to be natively supported to be used in vLLM.
21+
The <project:#transformers-fallback> enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
1922

2023
:::{tip}
2124
The easiest way to check if your model is really supported at runtime is to run the program below:
@@ -40,50 +43,59 @@ If vLLM successfully returns text (for generative models) or hidden states (for
4043
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
4144
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
4245

46+
(transformers-fallback)=
47+
4348
### Transformers fallback
4449

45-
`vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
50+
vLLM can fallback to model implementations that are available in Transformers. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
4651

47-
To check if the backend is `transformers`, you can simply do this:
52+
To check if the backend is Transformers, you can simply do this:
4853

4954
```python
5055
from vllm import LLM
5156
llm = LLM(model=..., task="generate") # Name or path of your model
52-
llm.apply_model(lambda model: print(model.__class__))
57+
llm.apply_model(lambda model: print(type(model)))
5358
```
5459

55-
If it is `TransformersModel` then it means it's based on `transformers`!
60+
If it is `TransformersModel` then it means it's based on Transformers!
5661

57-
#### Supported features
62+
:::{note}
63+
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
64+
:::
5865

59-
##### Quantization
66+
#### Supported features
6067

61-
Transformers fallback has supported most of available quantization in vLLM (except GGUF). See [Quantization page](#quantization-index) for more information about supported quantization in vllm.
68+
The Transformers fallback explicitly supports the following features:
6269

63-
##### LoRA
70+
- <project:#quantization-index> (except GGUF)
71+
- <project:#lora-adapter>
72+
- <project:#distributed-serving> (pipeline parallel coming soon <gh-pr:12832>!)
6473

65-
Transformers fallback has supported LoRA. The usage way is identical to how LoRA works with models supported by vLLM. If you encounter any issues, please open an issue.
74+
#### Remote code
6675

67-
##### Remote code
76+
Earlier we mentioned that the Transformers fallback enables you to run remote code models directly in vLLM.
77+
If you are interested in this feature, this section is for you!
6878

69-
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
79+
Simply set `trust_remote_code=True` and vLLM will run any model on the Model Hub that is compatible with Transformers.
80+
Provided that the model writer implements their model in a compatible way, this means that you can run new models before they are officially supported in Transformers or vLLM!
7081

7182
```python
7283
from vllm import LLM
7384
llm = LLM(model=..., task="generate", trust_remote_code=True) # Name or path of your model
7485
llm.apply_model(lambda model: print(model.__class__))
7586
```
7687

77-
A model just needs the following two things:
88+
To make your model compatible with the Transformers fallback, it needs:
89+
90+
```{code-block} python
91+
:caption: modeling_my_model.py
7892
79-
```python
8093
from transformers import PreTrainedModel
8194
from torch import nn
8295
8396
class MyAttention(nn.Module):
8497
8598
def forward(self, hidden_states, **kwargs): # <- kwargs are required
86-
8799
...
88100
attention_interface = attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
89101
attn_output, attn_weights = attention_interface(
@@ -102,8 +114,26 @@ class MyModel(PreTrainedModel):
102114
Here is what happens in the background:
103115

104116
1. The config is loaded
105-
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
106-
3. The `TransformersModel` backend is used. See `/model_executors/models/transformers`, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
117+
2. `MyModel` Python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
118+
3. The `TransformersModel` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
119+
120+
To make your model compatible with tensor parallel, it needs:
121+
122+
```{code-block} python
123+
:caption: configuration_my_model.py
124+
125+
from transformers import PretrainedConfig
126+
127+
class MyConfig(PretrainedConfig):
128+
base_model_tp_plan = {
129+
"layers.*.self_attn.q_proj": "colwise",
130+
...
131+
}
132+
```
133+
134+
:::{tip}
135+
`base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
136+
:::
107137

108138
That's it!
109139

@@ -893,7 +923,7 @@ Currently the PaliGemma model series is implemented without PrefixLM attention m
893923
:::
894924

895925
:::{note}
896-
To use Qwen2.5-VL series models, you have to install Huggingface `transformers` library from source via `pip install git+https://github.com/huggingface/transformers`.
926+
To use Qwen2.5-VL series models, you have to install Hugging Face Transformers library from source via `pip install git+https://github.com/huggingface/transformers`.
897927
:::
898928

899929
### Pooling Models

0 commit comments

Comments
 (0)