[Bugfix] EAGLE output norm bug #14464

luyuzhe111 · 2025-03-07T23:39:57Z

The current dummy output norm for EAGLE is incorrect. After fixing, the accepted tokens improved drastically from 1.4 to 1.8 with 2 speculated tokens. Essentially, the dummy RMS norm should return hidden + residual instead of hidden only.

Additionally, I enabled tracking the number of accepted tokens at request level. Currently, I believe the only way to check acceptance length is by looking at the acceptance rate from the intermittent system logs. This makes debugging and reproduction challenging. Enabling request level stats will be particularly helpful given that the vLLM EAGLE implementation has at least two more bugs that harm EAGLE's acceptance length: the first being the first token is not properly removed, and the second being the EAGLE model keeps using contaminated KV cache with its own hidden states instead of target model's hidden states for accepted tokens. These factors in part explain the low performance reported in issue 9565. I'm happy to collaborate on solving these issues.

Finally, I added an example script for EAGLE where I show how request level acceptance length can be extracted. This also appeals to the issue 11126 where the community is a bit confused about how to use EAGLE.

Would appreciate your review @LiuXiaoxuanPKU!

github-actions · 2025-03-07T23:40:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Bryan Lu <[email protected]>

LiuXiaoxuanPKU

Thanks for this PR!

The fix looks good to me.
Adding request level acceptance metric also sounds good to me. But I do have some questions about styles/definition, see comments.
Could you open an issue to track the following two bugs you mentioned in this PR's description?

Thanks!

LiuXiaoxuanPKU · 2025-03-11T18:39:56Z

examples/offline_inference/eagle.py

@@ -0,0 +1,90 @@
+# SPDX-License-Identifier: Apache-2.0


Great! Could you also add a sentence in the doc (https://github.com/vllm-project/vllm/blob/main/docs/source/features/spec_decode.md), referring this example?

LiuXiaoxuanPKU · 2025-03-11T18:41:41Z

vllm/sequence.py

@@ -121,6 +121,7 @@ class RequestMetrics:
    scheduler_time: Optional[float] = None
    model_forward_time: Optional[float] = None
    model_execute_time: Optional[float] = None
+    node_acceptance_counts: Optional[list[int]] = None


Could you add a comment to this field?

LiuXiaoxuanPKU · 2025-03-11T18:52:58Z

vllm/engine/llm_engine.py

@@ -830,6 +830,10 @@ def _create_sequence_group_with_sampling(
            self.generation_config_fields, seq.eos_token_id)

        # Create the sequence group.
+        draft_size = 1


a better name?
Precisely, it's not draft_size, it's the maximum number of tokens a step can generate?

I named it draft_size because in the future there might be multi-draft & tree attention support. In those cases, draft_size will not be the max number of tokens a step can generate, but rather the number of nodes in the draft tree. If you believe there is a better name I am more than happy to change it!

I see! Then could you comment it here and explain why it's different from num_spec_token.

LiuXiaoxuanPKU · 2025-03-11T19:02:23Z

vllm/spec_decode/spec_decode_worker.py

-                    ))
+                        step_index=None if
+                        accepted_token_ids_by_step[step_index][sequence_index]
+                        == -1 else step_index))


Confused, why do you need to make step_index None if token is not accepted?

so that we know this token is not accepted? alternatively, if we make it 0, then the root node (generated by the target model) will get a non-existent accepted token?

LiuXiaoxuanPKU · 2025-03-11T19:02:55Z

vllm/spec_decode/util.py

+        topk_token_ids: List[Optional[int]],
+        topk_logprobs: List[Optional[float]],
+        prompt_logprobs: Optional[PromptLogprobs] = None,
+        step_index: Optional[int] = 0) -> CompletionSequenceGroupOutput:


Same question as above, why is it an optional field?

Ah I made it optional to minimize changes since I don't have to modify the single step worker code to accommodate this additional field this way. What's a better way to do this?

LiuXiaoxuanPKU · 2025-03-11T19:11:48Z

vllm/engine/output_processor/multi_step.py

+        for output in outputs:
+            if output.step_index is not None:
+                sequence_group.metrics.node_acceptance_counts[
+                    output.step_index] += 1


Why do you need to add based on step_index, can you just add the number of generated tokens this step?

Also is it step_index can only be 0,1,2,...num_spec_tokens? Do you want to group the number of accepted tokens together based on the position it's proposed?

Yes, step_index can be [0, num_spec_tokens]. This way, we can analyze the additional accepted tokens for each additional speculation step. For example, for a finished request, we might observe its node_acceptance_counts to be [100, 80, 45, 20]. With this granularity, we can tell we had 100 forward passes and for the third speculated token, it's acceptance rate is only 20%, which may not worth the verification overhead.

I'm a bit confused regarding the grouping idea. If I understood it correctly, we only have single draft spec decoding atm, so step_index is essentially the position (n-th speculated token)?

I see, got it. Yeah then please also add [100, 80, 45, 20] example in the comments when you add comments to node_acceptance_counts.

LiuXiaoxuanPKU · 2025-03-11T19:13:37Z

vllm/model_executor/models/eagle.py

@@ -38,7 +38,7 @@ def forward(self, x, residual):
        if residual is None:
            return x
        else:
-            return x, residual
+            return x + residual, None


Thanks for the catch!

luyuzhe111 · 2025-03-12T00:54:07Z

Thanks for this PR!

The fix looks good to me.

Adding request level acceptance metric also sounds good to me. But I do have some questions about styles/definition, see comments.

Could you open an issue to track the following two bugs you mentioned in this PR's description?

Thanks!

Hi Lily, thanks for reviewing my PR! I will address them correspondingly as I get more clarifications on some of your suggestions & concerns. The issues for those two bugs are now detailed and tracked in #14647, #14649.

Signed-off-by: Bryan Lu <[email protected]>

LiuXiaoxuanPKU

LGTM, thanks!

DarkLight1337 · 2025-03-14T16:51:51Z

Please fix the doc build error

DarkLight1337 · 2025-03-14T17:20:17Z

Also need to merge from main to fix multi-modal CI failure

Signed-off-by: Bryan Lu <[email protected]>

luyuzhe111 · 2025-03-14T19:01:01Z

Please fix the doc build error

Hello @DarkLight1337, thanks for the note! The error is due to the fact that the linked file in the doc is actually part of this PR. Wondering what's the best practice here? Thanks again for your time and help!

Signed-off-by: Bryan Lu <[email protected]>

DarkLight1337 · 2025-03-15T03:42:07Z

docs/source/features/spec_decode.md

@@ -162,7 +162,7 @@ A variety of speculative models of this type are available on HF hub:
 ## Speculating using EAGLE based draft models

 The following code configures vLLM to use speculative decoding where proposals are generated by
-an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](/examples/offline_inference/eagle.py).
+an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py).


Suggested change

an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py).

an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](<gh-file:examples/offline_inference/eagle.py>).

We can use the gh-file scheme (defined by myst_url_schemes in the config)

llsj14 · 2025-03-19T04:59:45Z

@luyuzhe111
Thank you for the fix! It resolved the performance issue we encountered in our experiment.

luyuzhe111 requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners March 7, 2025 23:39

mergify bot added documentation Improvements or additions to documentation speculative-decoding labels Mar 7, 2025

luyuzhe111 added 2 commits March 7, 2025 23:54

fix EAGLE final norm bug

7dee20f

Signed-off-by: Bryan Lu <[email protected]>

fix linter errors

143b13a

Signed-off-by: Bryan Lu <[email protected]>

luyuzhe111 force-pushed the main branch from 7da75ab to 143b13a Compare March 8, 2025 00:01

fix wrong type hint

acaf894

Signed-off-by: Bryan Lu <[email protected]>

LiuXiaoxuanPKU self-assigned this Mar 8, 2025

LiuXiaoxuanPKU reviewed Mar 11, 2025

View reviewed changes

luyuzhe111 mentioned this pull request Mar 12, 2025

[Bug]: EAGLE / MTP Doesn't Overwrite Approximated Hidden States / KV Cache, 8%- 15% Acceptance Length Degradation #14649

Open

1 task

address feedbacks

5373496

Signed-off-by: Bryan Lu <[email protected]>

luyuzhe111 requested a review from LiuXiaoxuanPKU March 13, 2025 23:40

LiuXiaoxuanPKU approved these changes Mar 14, 2025

View reviewed changes

LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2025

luyuzhe111 added 2 commits March 14, 2025 18:34

Merge remote-tracking branch 'upstream/main'

85d5a91

fix doc build error

a16e424

Signed-off-by: Bryan Lu <[email protected]>

change to url

9f22b9f

Signed-off-by: Bryan Lu <[email protected]>

DarkLight1337 reviewed Mar 15, 2025

View reviewed changes

Update docs/source/features/spec_decode.md

d1ea6da

DarkLight1337 enabled auto-merge (squash) March 15, 2025 03:44

DarkLight1337 disabled auto-merge March 15, 2025 03:44

DarkLight1337 changed the title ~~Fix EAGLE output norm bug~~ [Bugfix EAGLE output norm bug Mar 15, 2025

DarkLight1337 changed the title ~~[Bugfix EAGLE output norm bug~~ [Bugfix] EAGLE output norm bug Mar 15, 2025

DarkLight1337 enabled auto-merge (squash) March 15, 2025 03:44

DarkLight1337 merged commit 9ed6ee9 into vllm-project:main Mar 15, 2025
39 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] EAGLE output norm bug #14464

[Bugfix] EAGLE output norm bug #14464

luyuzhe111 commented Mar 7, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 7, 2025

LiuXiaoxuanPKU left a comment

LiuXiaoxuanPKU Mar 11, 2025

LiuXiaoxuanPKU Mar 11, 2025

LiuXiaoxuanPKU Mar 11, 2025

luyuzhe111 Mar 11, 2025

LiuXiaoxuanPKU Mar 13, 2025

LiuXiaoxuanPKU Mar 11, 2025

luyuzhe111 Mar 11, 2025

LiuXiaoxuanPKU Mar 11, 2025

luyuzhe111 Mar 11, 2025

LiuXiaoxuanPKU Mar 11, 2025

LiuXiaoxuanPKU Mar 11, 2025

luyuzhe111 Mar 11, 2025

LiuXiaoxuanPKU Mar 13, 2025

LiuXiaoxuanPKU Mar 11, 2025

luyuzhe111 commented Mar 12, 2025

LiuXiaoxuanPKU left a comment

DarkLight1337 commented Mar 14, 2025

DarkLight1337 commented Mar 14, 2025

luyuzhe111 commented Mar 14, 2025

DarkLight1337 Mar 15, 2025

llsj14 commented Mar 19, 2025

	an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py).
	an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](<gh-file:examples/offline_inference/eagle.py>).

[Bugfix] EAGLE output norm bug #14464

[Bugfix] EAGLE output norm bug #14464

Conversation

luyuzhe111 commented Mar 7, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 7, 2025

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luyuzhe111 commented Mar 12, 2025

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Mar 14, 2025

DarkLight1337 commented Mar 14, 2025

luyuzhe111 commented Mar 14, 2025

Choose a reason for hiding this comment

llsj14 commented Mar 19, 2025

luyuzhe111 commented Mar 7, 2025 •

edited by github-actions bot

Loading