Use Cache Hinting for fused_moe kernel #15511

wrmedford · 2025-03-26T01:40:52Z

Forward compatible change with triton-lang/triton#6278. Benchmark below is for when it's merged.

Before

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  116.83
Total input tokens:                      50000
Total generated tokens:                  500000
Request throughput (req/s):              4.28
Output token throughput (tok/s):         4279.83
Total Token throughput (tok/s):          4707.81
---------------Time to First Token----------------
Mean TTFT (ms):                          161.36
Median TTFT (ms):                        133.79
P99 TTFT (ms):                           1117.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.26
Median TPOT (ms):                        72.86
P99 TPOT (ms):                           75.91
---------------Inter-token Latency----------------
Mean ITL (ms):                           72.19
Median ITL (ms):                         72.33
P99 ITL (ms):                            116.49
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  116.62
Total input tokens:                      50000
Total generated tokens:                  500000
Request throughput (req/s):              4.29
Output token throughput (tok/s):         4287.33
Total Token throughput (tok/s):          4716.07
---------------Time to First Token----------------
Mean TTFT (ms):                          141.31
Median TTFT (ms):                        131.75
P99 TTFT (ms):                           469.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.41
Median TPOT (ms):                        72.11
P99 TPOT (ms):                           74.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           71.34
Median ITL (ms):                         71.13
P99 ITL (ms):                            111.78
==================================================

Slight improvement, mostly in latency metrics.

Interface already exists, so this is safe to merge now.

Big thanks to @Apsu on all of the help here!

github-actions · 2025-03-26T01:41:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Wes Medford <[email protected]>

youkaichao · 2025-03-26T14:08:02Z

@LucasWilkinson @bnellnm can you help review?

LucasWilkinson · 2025-03-26T16:45:53Z

Seems straight forward enough, this shouldn't cause any issue on non-Nvidia hardware? i.e. these parameters are just ignored right?

bnellnm

lgtm

LucasWilkinson

LGTM assuming:

this shouldn't cause any issue on non-Nvidia hardware? i.e. these parameters are just ignored right?

wrmedford · 2025-03-26T18:16:10Z

Seems straight forward enough, this shouldn't cause any issue on non-Nvidia hardware? i.e. these parameters are just ignored right?

Afaik these bindings have existed in triton for a while and it's up to the backend to implement them. Otherwise they're ignored.

DefTruth · 2025-03-27T13:02:13Z

The latest stable version of triton dont contains: triton-lang/triton#6278

This reverts commit 7a88827.

This reverts commit 7a88827. Signed-off-by: Wes Medford <[email protected]>

Signed-off-by: Wes Medford <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: xinyuxiao <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: xinyuxiao <[email protected]>

Signed-off-by: Louis Ulmer <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

wrmedford added 2 commits March 25, 2025 19:55

(feat) add cache hints to experts in fused_moe kernel

8a45c2b

Signed-off-by: Wes Medford <[email protected]>

(lint) fix formatting

b7a2b72

Signed-off-by: Wes Medford <[email protected]>

wrmedford force-pushed the main branch from 77a57e8 to b7a2b72 Compare March 26, 2025 01:55

bnellnm approved these changes Mar 26, 2025

View reviewed changes

LucasWilkinson approved these changes Mar 26, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) March 26, 2025 19:58

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 26, 2025

LucasWilkinson merged commit 7a88827 into vllm-project:main Mar 26, 2025
41 of 42 checks passed

Qubitium mentioned this pull request Mar 27, 2025

[Bug]: Triton JIT Compile Regression from PR 15511 #15619

Closed

1 task

DefTruth added a commit to vipshop/vllm that referenced this pull request Mar 27, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)"

41b95e8

This reverts commit 7a88827.

wrmedford added a commit to wrmedford/vllm that referenced this pull request Mar 27, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)"

fd9c09d

This reverts commit 7a88827.

wrmedford added a commit to wrmedford/vllm that referenced this pull request Mar 27, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)"

2192356

This reverts commit 7a88827. Signed-off-by: Wes Medford <[email protected]>

vllm-bot pushed a commit that referenced this pull request Mar 28, 2025

Revert "Use Cache Hinting for fused_moe kernel (#15511)" (#15645)

4ae17bf

Signed-off-by: Wes Medford <[email protected]>

lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 2, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

c732ed0

lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 2, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

439f2bc

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]>

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

9085e1d

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

8141873

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

463c71a

Signed-off-by: xinyuxiao <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

0a9f2e4

Signed-off-by: Louis Ulmer <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

4771a27

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Cache Hinting for fused_moe kernel #15511

Use Cache Hinting for fused_moe kernel #15511

wrmedford commented Mar 26, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 26, 2025

youkaichao commented Mar 26, 2025

LucasWilkinson commented Mar 26, 2025

bnellnm left a comment

LucasWilkinson left a comment

wrmedford commented Mar 26, 2025

DefTruth commented Mar 27, 2025

Use Cache Hinting for fused_moe kernel #15511

Use Cache Hinting for fused_moe kernel #15511

Conversation

wrmedford commented Mar 26, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 26, 2025

youkaichao commented Mar 26, 2025

LucasWilkinson commented Mar 26, 2025

bnellnm left a comment

Choose a reason for hiding this comment

LucasWilkinson left a comment

Choose a reason for hiding this comment

wrmedford commented Mar 26, 2025

DefTruth commented Mar 27, 2025

wrmedford commented Mar 26, 2025 •

edited by github-actions bot

Loading