Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] EAGLE output norm bug #14464

Merged
merged 8 commits into from
Mar 15, 2025
Merged

Conversation

luyuzhe111
Copy link
Contributor

@luyuzhe111 luyuzhe111 commented Mar 7, 2025

The current dummy output norm for EAGLE is incorrect. After fixing, the accepted tokens improved drastically from 1.4 to 1.8 with 2 speculated tokens. Essentially, the dummy RMS norm should return hidden + residual instead of hidden only.

Additionally, I enabled tracking the number of accepted tokens at request level. Currently, I believe the only way to check acceptance length is by looking at the acceptance rate from the intermittent system logs. This makes debugging and reproduction challenging. Enabling request level stats will be particularly helpful given that the vLLM EAGLE implementation has at least two more bugs that harm EAGLE's acceptance length: the first being the first token is not properly removed, and the second being the EAGLE model keeps using contaminated KV cache with its own hidden states instead of target model's hidden states for accepted tokens. These factors in part explain the low performance reported in issue 9565. I'm happy to collaborate on solving these issues.

Finally, I added an example script for EAGLE where I show how request level acceptance length can be extracted. This also appeals to the issue 11126 where the community is a bit confused about how to use EAGLE.

Would appreciate your review @LiuXiaoxuanPKU!

Copy link

github-actions bot commented Mar 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation speculative-decoding labels Mar 7, 2025
Signed-off-by: Bryan Lu <[email protected]>
Signed-off-by: Bryan Lu <[email protected]>
@LiuXiaoxuanPKU LiuXiaoxuanPKU self-assigned this Mar 8, 2025
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!

  1. The fix looks good to me.
  2. Adding request level acceptance metric also sounds good to me. But I do have some questions about styles/definition, see comments.
  3. Could you open an issue to track the following two bugs you mentioned in this PR's description?

Thanks!

@@ -0,0 +1,90 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Could you also add a sentence in the doc (https://github.com/vllm-project/vllm/blob/main/docs/source/features/spec_decode.md), referring this example?

vllm/sequence.py Outdated
@@ -121,6 +121,7 @@ class RequestMetrics:
scheduler_time: Optional[float] = None
model_forward_time: Optional[float] = None
model_execute_time: Optional[float] = None
node_acceptance_counts: Optional[list[int]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to this field?

@@ -830,6 +830,10 @@ def _create_sequence_group_with_sampling(
self.generation_config_fields, seq.eos_token_id)

# Create the sequence group.
draft_size = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better name?
Precisely, it's not draft_size, it's the maximum number of tokens a step can generate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it draft_size because in the future there might be multi-draft & tree attention support. In those cases, draft_size will not be the max number of tokens a step can generate, but rather the number of nodes in the draft tree. If you believe there is a better name I am more than happy to change it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Then could you comment it here and explain why it's different from num_spec_token.

))
step_index=None if
accepted_token_ids_by_step[step_index][sequence_index]
== -1 else step_index))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused, why do you need to make step_index None if token is not accepted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that we know this token is not accepted? alternatively, if we make it 0, then the root node (generated by the target model) will get a non-existent accepted token?

topk_token_ids: List[Optional[int]],
topk_logprobs: List[Optional[float]],
prompt_logprobs: Optional[PromptLogprobs] = None,
step_index: Optional[int] = 0) -> CompletionSequenceGroupOutput:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above, why is it an optional field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I made it optional to minimize changes since I don't have to modify the single step worker code to accommodate this additional field this way. What's a better way to do this?

for output in outputs:
if output.step_index is not None:
sequence_group.metrics.node_acceptance_counts[
output.step_index] += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to add based on step_index, can you just add the number of generated tokens this step?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it step_index can only be 0,1,2,...num_spec_tokens? Do you want to group the number of accepted tokens together based on the position it's proposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, step_index can be [0, num_spec_tokens]. This way, we can analyze the additional accepted tokens for each additional speculation step. For example, for a finished request, we might observe its node_acceptance_counts to be [100, 80, 45, 20]. With this granularity, we can tell we had 100 forward passes and for the third speculated token, it's acceptance rate is only 20%, which may not worth the verification overhead.

I'm a bit confused regarding the grouping idea. If I understood it correctly, we only have single draft spec decoding atm, so step_index is essentially the position (n-th speculated token)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, got it. Yeah then please also add [100, 80, 45, 20] example in the comments when you add comments to node_acceptance_counts.

@@ -38,7 +38,7 @@ def forward(self, x, residual):
if residual is None:
return x
else:
return x, residual
return x + residual, None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

@luyuzhe111
Copy link
Contributor Author

Thanks for this PR!

  1. The fix looks good to me.
  2. Adding request level acceptance metric also sounds good to me. But I do have some questions about styles/definition, see comments.
  3. Could you open an issue to track the following two bugs you mentioned in this PR's description?

Thanks!

Hi Lily, thanks for reviewing my PR! I will address them correspondingly as I get more clarifications on some of your suggestions & concerns. The issues for those two bugs are now detailed and tracked in #14647, #14649.

Signed-off-by: Bryan Lu <[email protected]>
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@LiuXiaoxuanPKU LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2025
@DarkLight1337
Copy link
Member

Please fix the doc build error

@DarkLight1337
Copy link
Member

Also need to merge from main to fix multi-modal CI failure

@luyuzhe111
Copy link
Contributor Author

Please fix the doc build error

Hello @DarkLight1337, thanks for the note! The error is due to the fact that the linked file in the doc is actually part of this PR. Wondering what's the best practice here? Thanks again for your time and help!

Signed-off-by: Bryan Lu <[email protected]>
@@ -162,7 +162,7 @@ A variety of speculative models of this type are available on HF hub:
## Speculating using EAGLE based draft models

The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](/examples/offline_inference/eagle.py).
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py).
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](<gh-file:examples/offline_inference/eagle.py>).

We can use the gh-file scheme (defined by myst_url_schemes in the config)

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 15, 2025 03:44
@DarkLight1337 DarkLight1337 disabled auto-merge March 15, 2025 03:44
@DarkLight1337 DarkLight1337 changed the title Fix EAGLE output norm bug [Bugfix EAGLE output norm bug Mar 15, 2025
@DarkLight1337 DarkLight1337 changed the title [Bugfix EAGLE output norm bug [Bugfix] EAGLE output norm bug Mar 15, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 15, 2025 03:44
@DarkLight1337 DarkLight1337 merged commit 9ed6ee9 into vllm-project:main Mar 15, 2025
39 of 40 checks passed
@llsj14
Copy link
Contributor

llsj14 commented Mar 19, 2025

@luyuzhe111
Thank you for the fix! It resolved the performance issue we encountered in our experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants