-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] EAGLE output norm bug #14464
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Bryan Lu <[email protected]>
Signed-off-by: Bryan Lu <[email protected]>
Signed-off-by: Bryan Lu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR!
- The fix looks good to me.
- Adding request level acceptance metric also sounds good to me. But I do have some questions about styles/definition, see comments.
- Could you open an issue to track the following two bugs you mentioned in this PR's description?
Thanks!
@@ -0,0 +1,90 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Could you also add a sentence in the doc (https://github.com/vllm-project/vllm/blob/main/docs/source/features/spec_decode.md), referring this example?
vllm/sequence.py
Outdated
@@ -121,6 +121,7 @@ class RequestMetrics: | |||
scheduler_time: Optional[float] = None | |||
model_forward_time: Optional[float] = None | |||
model_execute_time: Optional[float] = None | |||
node_acceptance_counts: Optional[list[int]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to this field?
@@ -830,6 +830,10 @@ def _create_sequence_group_with_sampling( | |||
self.generation_config_fields, seq.eos_token_id) | |||
|
|||
# Create the sequence group. | |||
draft_size = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a better name?
Precisely, it's not draft_size, it's the maximum number of tokens a step can generate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named it draft_size because in the future there might be multi-draft & tree attention support. In those cases, draft_size
will not be the max number of tokens a step can generate, but rather the number of nodes in the draft tree. If you believe there is a better name I am more than happy to change it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Then could you comment it here and explain why it's different from num_spec_token.
)) | ||
step_index=None if | ||
accepted_token_ids_by_step[step_index][sequence_index] | ||
== -1 else step_index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused, why do you need to make step_index
None if token is not accepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that we know this token is not accepted? alternatively, if we make it 0, then the root node (generated by the target model) will get a non-existent accepted token?
topk_token_ids: List[Optional[int]], | ||
topk_logprobs: List[Optional[float]], | ||
prompt_logprobs: Optional[PromptLogprobs] = None, | ||
step_index: Optional[int] = 0) -> CompletionSequenceGroupOutput: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above, why is it an optional field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I made it optional to minimize changes since I don't have to modify the single step worker code to accommodate this additional field this way. What's a better way to do this?
for output in outputs: | ||
if output.step_index is not None: | ||
sequence_group.metrics.node_acceptance_counts[ | ||
output.step_index] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to add based on step_index
, can you just add the number of generated tokens this step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is it step_index
can only be 0,1,2,...num_spec_tokens? Do you want to group the number of accepted tokens together based on the position it's proposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, step_index
can be [0, num_spec_tokens]. This way, we can analyze the additional accepted tokens for each additional speculation step. For example, for a finished request, we might observe its node_acceptance_counts
to be [100, 80, 45, 20]. With this granularity, we can tell we had 100 forward passes and for the third speculated token, it's acceptance rate is only 20%, which may not worth the verification overhead.
I'm a bit confused regarding the grouping idea. If I understood it correctly, we only have single draft spec decoding atm, so step_index
is essentially the position (n-th speculated token)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, got it. Yeah then please also add [100, 80, 45, 20] example in the comments when you add comments to node_acceptance_counts
.
@@ -38,7 +38,7 @@ def forward(self, x, residual): | |||
if residual is None: | |||
return x | |||
else: | |||
return x, residual | |||
return x + residual, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
Hi Lily, thanks for reviewing my PR! I will address them correspondingly as I get more clarifications on some of your suggestions & concerns. The issues for those two bugs are now detailed and tracked in #14647, #14649. |
Signed-off-by: Bryan Lu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Please fix the doc build error |
Also need to merge from main to fix multi-modal CI failure |
Signed-off-by: Bryan Lu <[email protected]>
Hello @DarkLight1337, thanks for the note! The error is due to the fact that the linked file in the doc is actually part of this PR. Wondering what's the best practice here? Thanks again for your time and help! |
Signed-off-by: Bryan Lu <[email protected]>
docs/source/features/spec_decode.md
Outdated
@@ -162,7 +162,7 @@ A variety of speculative models of this type are available on HF hub: | |||
## Speculating using EAGLE based draft models | |||
|
|||
The following code configures vLLM to use speculative decoding where proposals are generated by | |||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](/examples/offline_inference/eagle.py). | |||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py). | |
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](<gh-file:examples/offline_inference/eagle.py>). |
We can use the gh-file
scheme (defined by myst_url_schemes
in the config)
@luyuzhe111 |
The current dummy output norm for EAGLE is incorrect. After fixing, the accepted tokens improved drastically from 1.4 to 1.8 with 2 speculated tokens. Essentially, the dummy RMS norm should return
hidden + residual
instead ofhidden
only.Additionally, I enabled tracking the number of accepted tokens at request level. Currently, I believe the only way to check acceptance length is by looking at the acceptance rate from the intermittent system logs. This makes debugging and reproduction challenging. Enabling request level stats will be particularly helpful given that the vLLM EAGLE implementation has at least two more bugs that harm EAGLE's acceptance length: the first being the first token is not properly removed, and the second being the EAGLE model keeps using contaminated KV cache with its own hidden states instead of target model's hidden states for accepted tokens. These factors in part explain the low performance reported in issue 9565. I'm happy to collaborate on solving these issues.
Finally, I added an example script for EAGLE where I show how request level acceptance length can be extracted. This also appeals to the issue 11126 where the community is a bit confused about how to use EAGLE.
Would appreciate your review @LiuXiaoxuanPKU!