[WIP]Chunked Prefill #188

mailvijayasingh · 2025-02-13T23:19:39Z

If chunked prefill is set to True,
Chunk and appropriately pad the tokens
Call prefill for each chunk

vipannalla · 2025-02-14T06:45:49Z

benchmarks/benchmark_serving.py

@@ -518,6 +518,7 @@ async def send_request(
  """Send the request to JetStream server."""
  # Tokenize on client side following MLPerf standard.
  token_ids = tokenizer.encode(input_request.prompt)
+  print("len token_ids ", len(token_ids))


delete? or use log.debug()

vipannalla · 2025-02-14T06:49:25Z

jetstream/engine/token_utils.py

+    # 64,
+    # 128,


We should jit compile 128 bucket size

vipannalla · 2025-02-18T18:48:50Z

jetstream/core/orchestrator.py

+                                                                 )
+          else:
+            jax.debug.print("calling chunked_prefill for {chunk_num}", chunk_num=chunk_num)
+            prefill_result, first_token = prefill_engine.prefill(params=prefill_params | {"cache": prefill_result["cache"]},


Is "cache" supposed to represent KV cache from previous chunks so far? Can we rename it to "cache_so_far"?

vipannalla · 2025-02-18T18:52:57Z

jetstream/core/orchestrator.py

+          if prefill_result is None:
+            jax.debug.print("calling chunked_prefill for {chunk_num}", chunk_num=chunk_num)
+            prefill_result, first_token = prefill_engine.prefill(params=prefill_params,
+                                                                 padded_tokens=padded_tokens[chunk_num],
+                                                                 true_length=true_lengths[chunk_num],
+                                                                 positions=positions[chunk_num],
+                                                                 all_true_length=true_length,
+                                                                 previous_chunk=prefill_result,
+                                                                 )
+          else:
+            jax.debug.print("calling chunked_prefill for {chunk_num}", chunk_num=chunk_num)
+            prefill_result, first_token = prefill_engine.prefill(params=prefill_params | {"cache": prefill_result["cache"]},
+                                                                 padded_tokens=padded_tokens[chunk_num],
+                                                                 true_length=true_lengths[chunk_num],
+                                                                 positions=positions[chunk_num],
+                                                                 all_true_length=true_length,
+                                                                 previous_chunk=prefill_result,
+                                                                 )


You can get rid of forking to make the code more readable:

cache_so_far = {} if prefill_result is None else {"cache_so_far": prefill_result["cache"]} prefill_result, first_token = prefill_engine.prefill(params=prefill_params | cache_so_far, ....) ...

vipannalla · 2025-02-18T18:59:53Z

jetstream/engine/token_utils.py

+    if total token size is 520 and chunk size is 256, 
+    the function will return 3 chunks and return tuple is as follows- 
+    [[t0,..t255][t256,..t511][t512,..t519...(padding)]], 
+    [256, 256, 7+padding], 


nit: the true lengths returned should be [256, 256, 7] (no padding)

vipannalla · 2025-02-18T19:00:37Z

jetstream/engine/token_utils.py

+    the function will return 3 chunks and return tuple is as follows- 
+    [[t0,..t255][t256,..t511][t512,..t519...(padding)]], 
+    [256, 256, 7+padding], 
+    [[0,..255],[256,..511],[512..518..]]


nit: [512..519..] (ends at 519)

vipannalla · 2025-02-18T19:40:10Z

jetstream/core/orchestrator.py

+          t_l_array = jnp.expand_dims(jnp.arange(0, chunk_num*prefill_engine.chunk_size + true_lengths[chunk_num]), 0)
+          prefill_result['t_l_array'] =  t_l_array


Isn't t_l_array same as positions array? where is this used?

Chunked Prefill

8d9ae19

mailvijayasingh requested a review from vipannalla as a code owner February 13, 2025 23:19

mailvijayasingh changed the title ~~Chunked Prefill~~ [WIP]Chunked Prefill Feb 13, 2025

vipannalla reviewed Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Chunked Prefill #188

[WIP]Chunked Prefill #188

mailvijayasingh commented Feb 13, 2025

vipannalla Feb 14, 2025

vipannalla Feb 14, 2025

vipannalla Feb 18, 2025

vipannalla Feb 18, 2025

vipannalla Feb 18, 2025

vipannalla Feb 18, 2025

vipannalla Feb 18, 2025

		t_l_array = jnp.expand_dims(jnp.arange(0, chunk_num*prefill_engine.chunk_size + true_lengths[chunk_num]), 0)
		prefill_result['t_l_array'] = t_l_array

[WIP]Chunked Prefill #188

Are you sure you want to change the base?

[WIP]Chunked Prefill #188

Conversation

mailvijayasingh commented Feb 13, 2025

vipannalla Feb 14, 2025

Choose a reason for hiding this comment

vipannalla Feb 14, 2025

Choose a reason for hiding this comment

vipannalla Feb 18, 2025

Choose a reason for hiding this comment

vipannalla Feb 18, 2025

Choose a reason for hiding this comment

vipannalla Feb 18, 2025

Choose a reason for hiding this comment

vipannalla Feb 18, 2025

Choose a reason for hiding this comment

vipannalla Feb 18, 2025

Choose a reason for hiding this comment