Skip to content

Commit aaacf17

Browse files
JenZhaoywang96DarkLight1337
authored
[Doc] V1 user guide (vllm-project#13991)
Signed-off-by: Jennifer Zhao <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Jennifer Zhao <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
1 parent 4c7629c commit aaacf17

File tree

2 files changed

+161
-0
lines changed

2 files changed

+161
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# vLLM V1 User Guide
2+
3+
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
4+
5+
## Why vLLM V1?
6+
7+
vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.
8+
9+
Building on V0’s success, vLLM V1 retains the stable and proven components from V0
10+
(such as the models, GPU kernels, and utilities). At the same time, it significantly
11+
re-architects the core systems, covering the scheduler, KV cache manager, worker,
12+
sampler, and API server, to provide a cohesive, maintainable framework that better
13+
accommodates continued growth and innovation.
14+
15+
Specifically, V1 aims to:
16+
17+
- Provide a **simple, modular, and easy-to-hack codebase**.
18+
- Ensure **high performance** with near-zero CPU overhead.
19+
- **Combine key optimizations** into a unified architecture.
20+
- Require **zero configs** by enabling features/optimizations by default.
21+
22+
We see significant performance improvements from upgrading to V1 core engine, in
23+
particular for long context scenarios. Please see performance benchmark (To be
24+
added).
25+
26+
For more details, check out the vLLM V1 blog post [vLLM V1: A Major
27+
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).
28+
29+
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
30+
31+
### Supports Overview
32+
#### Hardware
33+
34+
| Hardware | Status |
35+
|----------|------------------------------------------|
36+
| **NVIDIA** | <nobr>🚀 Natively Supported</nobr> |
37+
| **AMD** | <nobr>🚧 WIP</nobr> |
38+
| **TPU** | <nobr>🚧 WIP</nobr> |
39+
#### Feature / Model
40+
41+
| Feature / Model | Status |
42+
|-----------------|-----------------------------------------------------------------------------------|
43+
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
44+
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
45+
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
46+
| **LoRA** | <nobr>🟢 Functional ([PR #13096](https://github.com/vllm-project/vllm/pull/13096))</nobr>|
47+
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
48+
| **Spec Decode** | <nobr>🚧 WIP ([PR #13933](https://github.com/vllm-project/vllm/pull/13933))</nobr>|
49+
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
50+
| **FP8 KV Cache** | <nobr>🟡 Planned</nobr> |
51+
| **Structured Output Alternative Backends** | <nobr>🟡 Planned</nobr> |
52+
| **Embedding Models** | <nobr>🟡 Planned ([RFC #12249](https://github.com/vllm-project/vllm/issues/12249))</nobr> |
53+
| **Mamba Models** | <nobr>🟡 Planned</nobr> |
54+
| **Encoder-Decoder Models** | <nobr>🟡 Planned</nobr> |
55+
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
56+
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
57+
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
58+
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
59+
60+
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
61+
- **🟢 Functional**: Fully operational, with ongoing optimizations.
62+
- **🚧 WIP**: Under active development.
63+
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
64+
- **🔴 Deprecated**: Not planned for v1 unless there is strong demand.
65+
66+
**Note**: vLLM V1’s unified scheduler treats both prompt and output tokens the same
67+
way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically
68+
allocate a fixed token budget per request, enabling features like chunked prefills,
69+
prefix caching, and speculative decoding without a strict separation between prefill
70+
and decode phases.
71+
72+
### Semantic Changes and Deprecated Features
73+
74+
#### Logprobs
75+
76+
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
77+
differences compared to V0:
78+
79+
**Logprobs Calculation**
80+
81+
Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
82+
before applying any logits post-processing such as temperature scaling or penalty
83+
adjustments). As a result, the returned logprobs do not reflect the final adjusted
84+
probabilities used during sampling.
85+
86+
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
87+
88+
**Prompt Logprobs with Prefix Caching**
89+
90+
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
91+
92+
#### Deprecated Features
93+
94+
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
95+
96+
**Sampling features**
97+
98+
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
99+
- **Per-Request Logits Processors**: In V0, users could pass custom
100+
processing functions to adjust logits on a per-request basis. In vLLM V1, this
101+
feature has been deprecated. Instead, the design is moving toward supporting **global logits
102+
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
103+
104+
**KV Cache features**
105+
106+
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
107+
to handle request preemptions.
108+
109+
**Structured Output features**
110+
111+
- **Request-level Structured Output Backend**: Deprecated, alternative backends
112+
(outlines, guidance) with fallbacks is WIP.
113+
### Feature & Model Support in Progress
114+
115+
Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.
116+
117+
#### Features to Be Optimized
118+
119+
These features are already supported in vLLM V1, but their optimization is still
120+
in progress.
121+
122+
- **LoRA**: LoRA is functionally working on vLLM V1 but its performance is
123+
inferior to that of V0. The team is actively working on improving its
124+
performance
125+
(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)).
126+
127+
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
128+
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
129+
130+
#### Features to Be Supported
131+
132+
- **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache.
133+
134+
- **Structured Output Alternative Backends**: Structured output alternative backends (outlines, guidance) support is planned. V1 currently
135+
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
136+
Details about the structured outputs can be found
137+
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).
138+
139+
#### Models to Be Supported
140+
141+
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
142+
and the majority fall into the following categories. V1 support for these models will be added eventually.
143+
144+
**Embedding Models**
145+
Instead of having a separate model runner, hidden states processor [RFC #12249](https://github.com/vllm-project/vllm/issues/12249), which is based on global logits processor [RFC #13360](https://github.com/vllm-project/vllm/pull/13360), has been proposed to enable simultaneous generation and embedding using the same engine instance in V1. It is still in the planning stage.
146+
147+
**Mamba Models**
148+
Models using selective state-space mechanisms (instead of standard transformer attention)
149+
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).
150+
151+
**Encoder-Decoder Models**
152+
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
153+
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
154+
155+
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
156+
157+
## FAQ
158+
159+
TODO

docs/source/index.md

+2
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@ getting_started/quickstart
6767
getting_started/examples/examples_index
6868
getting_started/troubleshooting
6969
getting_started/faq
70+
getting_started/v1_user_guide
71+
7072
:::
7173

7274
% What does vLLM support?

0 commit comments

Comments
 (0)