Cherry-pick of "Selective merged prefill #643" #893

kamil-kaczor · 2025-03-06T09:54:22Z

Cherry-pick of #643

kdamaszk · 2025-03-11T14:28:24Z

vllm/worker/hpu_model_runner.py

@@ -227,6 +228,33 @@ def find_rope_layer(parent, path):
    return path_to_rope


+class HPUBucketingContextWithMergedPrefill(HPUBucketingContext):


I think this should be a part of https://github.com/HabanaAI/vllm-hpu-extension/blob/main/vllm_hpu_extension/bucketing.py

kdamaszk · 2025-03-11T14:31:20Z

vllm/worker/hpu_model_runner.py

-                                                 self.max_num_prefill_seqs,
-                                                 self.block_size,
-                                                 self.max_num_batched_tokens)
+        self.enable_merged_prefill = os.environ.get('VLLM_MERGED_PREFILL',


It would be good to add this flag to the README with small explanation.

madamczykhabana · 2025-03-12T06:20:25Z

vllm/attention/backends/hpu_attn.py

+def prompt_fsdpa(
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attn_bias: Optional[torch.Tensor] = None,
+    p: float = 0.0,
+    scale: Optional[float] = None,
+    matmul_qk_op=torch.matmul,
+    softmax_op=torch.softmax,
+    matmul_av_op=torch.matmul,
+    valid_seq_lengths: Optional[torch.Tensor] = None,
+    fsdpa_op=None,
+) -> torch.Tensor:
+    query = query.transpose(1, 2)
+    key = key.transpose(1, 2)
+    value = value.transpose(1, 2)
+    softmax_mode = 'fast'
+    recompute_mode = True
+    attn_weights = fsdpa_op(query, key, value, attn_bias, 0.0, False, scale,
+                            softmax_mode, recompute_mode, None, 'right')
+    attn_weights = attn_weights.transpose(1, 2)
+    return attn_weights
+


This should be in hpu-extension/ops.py where all our attn implementations are. What's the difference between this and other implementations? Is it only because is_causal is False?

yes, duplicate this func to use 'is_causal' as False.

madamczykhabana · 2025-03-12T06:20:53Z

vllm/worker/hpu_model_runner.py

@@ -227,6 +228,33 @@ def find_rope_layer(parent, path):
    return path_to_rope


+class HPUBucketingContextWithMergedPrefill(HPUBucketingContext):


madamczykhabana · 2025-03-12T06:23:10Z

vllm/worker/hpu_model_runner.py

+        origin_enable_merged_prefill = self.enable_merged_prefill
+        self.enable_merged_prefill = False
        self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches,
                             False, True)
+        self.enable_merged_prefill = origin_enable_merged_prefill


This is a code smell. Could we do it differently?

kzawora-intel · 2025-03-12T08:43:53Z

vllm/worker/hpu_model_runner.py

+            if computed_block_nums is not None and len(
+                    computed_block_nums) > 0 and self.sliding_window is None:
+                # Prefix is not supported with sliding_window
+                context_len = len(computed_block_nums) * self.block_size
+                prompt_tokens = prompt_tokens[context_len:]
+                prefix_block_tables.append(computed_block_nums)
+            elif self.scheduler_config.chunked_prefill_enabled:
+                if seq_group_metadata.block_tables is not None:
+                    # Prefill has chunked before.
+                    block_table = seq_group_metadata.block_tables[seq_id]
+                    prefix_block_tables.append(block_table)
+                else:
+                    # The first prefill.
+                    prefix_block_tables.append([])
+            else:
+                prefix_block_tables.append([])
+                # Right now, prefill start is always 0. However, this
+                # assumption can be changed once chunked prefill is introduced.
+                assert context_len == 0


do we support prefix caching in merged prefill?

No, I didn't enable that, Yang and I had a discussion on how to add context_length as well, will need to re-work on attn_mask if we do so.

kzawora-intel · 2025-03-12T08:45:20Z

vllm/worker/hpu_model_runner.py

+            if self.sliding_window is not None:
+                assert context_len == 0, (
+                    "Prefix caching is currently not supported with "
+                    "sliding window attention")
+                start_idx = max(0, seq_len - self.sliding_window)


do we support sliding window attention in merged prefill?

kzawora-intel · 2025-03-12T08:46:07Z

vllm/worker/hpu_model_runner.py

+                                                  dtype=torch.long,
+                                                  device='cpu')
+
+        max_prefill_bs = int(os.environ.get('VLLM_PROMPT_BS_BUCKET_MAX', '8'))


we already have a parameter in scheduler_config for that, max_num_prefill_seqs

kzawora-intel · 2025-03-12T08:47:18Z

vllm/worker/hpu_model_runner.py

+            if (self.scheduler_config is not None
+                    and self.scheduler_config.chunked_prefill_enabled
+                    and not (computed_block_nums is None
+                             or computed_block_nums == [])):
+                raise RuntimeError(
+                    "chunked prefill cannot be used with prefix caching "
+                    "now.")


we don't support chunked prefill at all

That is correct, there is this PR however HabanaAI/vllm-hpu-extension#94

libinta · 2025-03-14T19:01:10Z

vllm/worker/hpu_model_runner.py

-            sampling_metadata = None
+            sampling_metadata.selected_token_indices = \
+                torch.cat((sampling_metadata.selected_token_indices, paddings),
+                          dim=0)


can you add the else portion? for the pooler, there is no sampling_medatdata

vllm/worker/hpu_model_runner.py

xuechendi and others added 4 commits March 10, 2025 14:26

Cherry-pick: "Selective merged prefill (#643)"

2afb048

Fix indent, linter, ruff

e96af62

Fix yapf, ruff

f9faa8a

Cherry-pick: "Selective merged prefill (#643)"

4f64e8a

kamil-kaczor force-pushed the cherrypick_merged_prefill branch from 8356d0f to 4f64e8a Compare March 10, 2025 13:33

kamil-kaczor and others added 9 commits March 10, 2025 15:36

Remove changes to benchmark throughput

d51eb62

Remove mypy WA from delayed

d6c2f02

Fix mypy in hpu_attn

16bca32

Remove second class of mergedprefill

3e71daf

Fix missing enable_kv_scales_calculation

254ea06

Fix mypy in model_runner

8fcfea5

Fix linter

33b0843

yapf fix

87c4e7e

Merge branch 'habana_main' into cherrypick_merged_prefill

6b1cfae

kamil-kaczor marked this pull request as ready for review March 11, 2025 14:16

kamil-kaczor requested review from kzawora-intel, madamczykhabana, michalkuligowski, mgawarkiewicz, vivekgoe and afierka-intel as code owners March 11, 2025 14:16

kdamaszk reviewed Mar 11, 2025

View reviewed changes

szutenberg requested a review from xuechendi March 11, 2025 15:13

madamczykhabana requested changes Mar 12, 2025

View reviewed changes

kzawora-intel reviewed Mar 12, 2025

View reviewed changes

libinta reviewed Mar 14, 2025

View reviewed changes

libinta reviewed Mar 18, 2025

View reviewed changes

vllm/worker/hpu_model_runner.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick of "Selective merged prefill #643" #893

Cherry-pick of "Selective merged prefill #643" #893

kamil-kaczor commented Mar 6, 2025

kdamaszk Mar 11, 2025

madamczykhabana Mar 12, 2025

kdamaszk Mar 11, 2025

madamczykhabana Mar 12, 2025

xuechendi Mar 12, 2025

madamczykhabana Mar 12, 2025

madamczykhabana Mar 12, 2025

kzawora-intel Mar 12, 2025

xuechendi Mar 12, 2025

kzawora-intel Mar 12, 2025

xuechendi Mar 12, 2025

kzawora-intel Mar 12, 2025

kzawora-intel Mar 12, 2025

michalkuligowski Mar 12, 2025

libinta Mar 14, 2025

		@@ -227,6 +228,33 @@ def find_rope_layer(parent, path):
		return path_to_rope


		class HPUBucketingContextWithMergedPrefill(HPUBucketingContext):

Cherry-pick of "Selective merged prefill #643" #893

Are you sure you want to change the base?

Cherry-pick of "Selective merged prefill #643" #893

Conversation

kamil-kaczor commented Mar 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment