Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

vnicolici · 2025-02-20T11:20:01Z

Name and Version

.\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4743 (d07c621)
built with MSVC 19.29.30158.0 for

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

> .\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log

Problem description & steps to reproduce

When using the llama-server and its Web UI, sometimes parts of the KV cache are truncated when they shouldn't be. Steps to reproduce:

Start llama-server with a command such as:

.\llama-server.exe -m .\models\unsloth\DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --verbose-prompt --dump-kv-cache --log-timestamps --log-prefix --verbose --alias 'DeepSeek-R1-UD-IQ1_S' --log-file DeepSeek-R1-UD-IQ1_S.log

This is the 1.58bit quantized version of the DeepSeek-R1 model by unsloth. I've been able to reproduce the issue with the 2.22bit version too.
However, I've NOT been able to reproduce it with DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf and Qwen2.5-7B-Instruct-1M-Q4_K_M.gguf

Open the Web UI, and disable "Exclude thought process when sending request to API (Recommended for DeepSeek-R1)"

In this way, the prompts sent should match the KV cache entirely within the same conversation, since the thinking that is included in the cache won't be excluded from the prompt.

Side note: In my opinion, including the thought processes in the prompts by the UI should be the default, as in my experience the quality of long conversations is negatively affected by excluding the thinking from the prompts. Also, not including the thinking means the cache needs to be recomputed starting with the end of the previous user input each time the user inputs something new in the chat, which slows down the assistant replies.

Basically this causes long pauses until the assistant starts generating new output after the new user input, as it needs to reprocess as a prompt the previous assistant output, without the thinking. In cases when the previous assistant reply is quite long, even without the thinking, this can take a long time (minutes, or even tens of minutes in extreme cases). I understand the advantages of removing thinking, as you can fit a long conversation in a smaller context if you keep removing the thinking from the context, but I'm not sure this outweighs the disadvantages.

Start a new conversation from the Web UI and enter a user prompt that is likely to cause significant amounts of assistant output, including "thinking", for example:

Tell me a romantic story, please.

Wait for the reply to be generated, and check the log to see how much information has accumulated in the cache. An example from my latest test:

8.45.064.978 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 1185, n_cache_tokens = 1185, truncated = 0

So, in this case, the cache contained 1185 tokens after the assistant replied to my initial prompt.

Add some new user input to the conversation. This time it doesn't necessarily need to generate a lot of output or cause a lot of thinking. For example:

Thank you, that's all.

Check the log again, to see how much of the cache has been truncated, you will find something like this:

12.30.856.271 I slot update_slots: id  0 | task 1170 | kv cache rm [488, end)

This means that the cache from position 488 to position 1185 has been discarded, for some reason.

In my opinion, this shouldn't happen, it should keep the entire content of the cache and not remove anything, since the new prompt is a continuation of the same conversation.

During my test, I tried identifying exactly what was previously in the cache at position 488, and it was a word in a sentence towards the end of the thinking, but it doesn't seem special in any way. Just the word "vivid" before the end of a sentence, and that sentence wasn't even the last sentence in the thinking section of the reply:

4.11.215.220 D slot process_toke: id  0 | task 0 | n_decoded = 470, n_remaining = -1, next token:   850 ' more'
...
4.11.215.232 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 487, n_cache_tokens = 487, truncated = 0
...
4.11.578.617 D slot process_toke: id  0 | task 0 | n_decoded = 471, n_remaining = -1, next token: 33949 ' vivid'
...
4.11.578.630 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 488, n_cache_tokens = 488, truncated = 0
...
4.11.934.032 D slot process_toke: id  0 | task 0 | n_decoded = 472, n_remaining = -1, next token:    16 '.'
...
4.11.934.047 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 489, n_cache_tokens = 489, truncated = 0

I've even coded my own command line API client in Go, and I was still able to replicate the issue. So it doesn't seem to be an bug in the Web UI, but an issue with the /v1/chat/completions API itself.

I have NOT been able to replicate this using llama-cli.exe, it works properly without discarding any parts of the cache during such conversations.

Currently, I'm forced to use this CLI, otherwise a 2 hour conversation with DeepSeek can easily turn into a 3-4 hour conversation, due to the caching issues.

I attached the log from my latest test.

DeepSeek-R1-UD-IQ1_S.zip

First Bad Commit

No response

Relevant log output

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-02-20T11:24:37Z

I suspect that for some reason the text with the new request tokenizes slightly different around the vivid word compared to what was generated.

vnicolici · 2025-02-20T11:30:30Z

OK, if there is anything else you want me to test to find the cause, I'm available.

ggerganov · 2025-02-20T11:35:45Z

In your example, what was the token after the . (id = 16)?

vnicolici · 2025-02-20T11:36:58Z

Two new lines ('\n\n'), between two sentences:

4.12.294.383 D slot process_toke: id  0 | task 0 | n_decoded = 473, n_remaining = -1, next token:   271 '

'

vnicolici · 2025-02-20T12:06:38Z

I just repeated my last test. It was "skip" instead of "vivid" this time, but again followed by a dot, 16, then by 271 (two new lines).

This time it wasn't even during the thinking, and it was the second instance of the 271 token. However, it was the first instance of a 16 followed by a 271.

9.23.600.369 D slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 4096, n_past = 1129, n_cache_tokens = 1129, truncated = 0
....
10.01.107.702 I slot update_slots: id  0 | task 1114 | kv cache rm [539, end)
...
5.35.683.014 D slot process_toke: id  0 | task 0 | n_decoded = 521, n_remaining = -1, next token:  4082 ' heart'
...
5.36.040.830 D slot process_toke: id  0 | task 0 | n_decoded = 522, n_remaining = -1, next token: 21429 ' skip'
...
5.36.407.388 D slot process_toke: id  0 | task 0 | n_decoded = 523, n_remaining = -1, next token:    16 '.'
...
5.36.764.273 D slot process_toke: id  0 | task 0 | n_decoded = 524, n_remaining = -1, next token:   271 '

'

ggerganov · 2025-02-20T12:20:15Z

Yes, as I expected it tokenizes differently on the way back to the server:

./bin/llama-tokenize -m r1.gguf -p " more vivid."

init: model is vocab-only -- no computation will be performed
     0 -> '<｜begin▁of▁sentence｜>'
   850 -> ' more'
 33949 -> ' vivid'
    16 -> '.'

./bin/llama-tokenize -m r1.gguf -p " more vivid.\n\n"

init: model is vocab-only -- no computation will be performed
     0 -> '<｜begin▁of▁sentence｜>'
   850 -> ' more'
 33949 -> ' vivid'
   339 -> '.

'

This means that either we have a bug in the pre-processor regexes of the R1 tokenizer, or this is simply a limitation of the tokenizer. If the latter, then the client would have to start storing the raw token ids along with the text and send the ids for the new requests.

vnicolici · 2025-02-20T14:59:16Z

If the latter, then the client would have to start storing the raw token ids along with the text and send the ids for the new requests.

Does the /v1/chat/completions API support this at the moment? I mean, returning a list of token ids, not just the tokens, and accepting a list of token ids instead of regular text content as the input? Also, is there some documentation for this API somewhere? If there is, I have trouble finding it. I only found links to the OpenAI documentation. I know they are compatible, but I assume there are differences.

vnicolici · 2025-02-20T18:12:38Z

After reading the READMEs, the source code, and experimenting a bit, it seems receiving/sending arrays of token IDs instead of strings is not supported by the API.

I managed to get the generated token IDs in the API response under "__verbose" / "tokens" using "return_tokens":true, but it only works with the server started with --verbose and it doesn't work if I enable streaming with "stream": true.

Then I tried sending back the token IDs with the next request, and the server didn't like such a request at all, I got Failed to parse messages: Missing content part type: 128798 for this request:

{
  "messages":[
    {"content":"You are a helpful assistant.","role":"system"},
    {"content":"Hello.","role":"user"},
    {"content":[128798,271,128799,271,19923,3,1730,588,342,8233,440,4316,33,53769,235],"role":"assistant"},
    {"content":"Who are you?","role":"user"}],
  "return_tokens":true
}

Looking at the source code, aside from a simple string, the content can be an array of objects, and each object must have a type attribute that should equal text, in which case it expects the object to have another attribute also named text. Looking further into the code, the purpose of that seems to be to send multiple lines of text as an array, so nothing to do with tokens.

ggerganov · 2025-02-20T18:15:20Z

I would first check if the tokenizer works correctly by comparing it to a reference implementation.

vnicolici · 2025-02-20T19:38:18Z

While I have no experience with LLM tokenizers, my feeling is there is nothing wrong with the tokenizer. The problematic text generated initially came from 2 tokens, the dot as the first token, and the two new lines as a single separate token. And when you tried to tokenize it, it got converted to a single token, consisting of the dot and the two new lines. That seems normal, to me.

For example, if for some strange reason a model generates the text "abc" as 3 different tokens, one for each character, and there is also a token that matches the entire sequence, I would expect the tokenizer to return that single token, for "abc", not 3 tokens for "a", "b" and "c". If the tokenizer didn't behave like that, everything would always be tokenized with a token for each character, which doesn't make any sense.

I don't think it's impossible, in general, for a model to generate 2 (or more) non equal sequences of tokens that represent the same text, depending on the seed and temperature and other factors. In which case, if you try to tokenize the text, at best it would match one of the two sequences, so it would behave "incorrectly" in 50% of the cases for such sequences and there would be no way to "fix" that.

In our particular case, involving new lines, maybe handling new lines differently in the tokenizer might solve the issue. But I'm not sure if it that would be and actual fix, or just a workaround for this particular situation. And there is also the risk that fixing this case might introduce issues in other cases.

So, I have an idea that might make the differences between the initial generation and the tokenization irrelevant when determining how much of the cache to keep, but I'm not sure how feasible would be to implement it.

Basically, if I understand correctly, right now we have the cache with an array of token IDs, the prompt received is converted to another array of token IDs (tokenized), then the two arrays are compared item by item until either a mismatch is found, or one of the two arrays ends, and the cache is truncated at that position. (There is also that similarity check for the situation when there are multiple slots, the minimum similarity to match a slot, but for simplicity I'll ignore it for now.)

A possible way to fix this would be to do it the other way around: initially do not do the tokenization of the input prompt, just leave it as a string. Then convert the cached token IDs to text, one by one, and for each one check if the input prompt text matches, and advance in the input prompt by the token length if it does. When a difference is found between the text corresponding to a token ID from the cache and the text from the input prompt, the cache can be truncated at that position.

The rest of the prompt string, starting with the position that didn't match the cache, can be then tokenized (if needed) and processed. I'm not very familiar with how this works, but I'm guessing you would need to tokenize at least the "new" part of the prompt (the part that doesn't match the cache) to be able to process it and then generate the assistant reply.

vnicolici · 2025-02-21T20:34:30Z

After further testing yesterday and today, I was able to confirm my hypothesis that a model can randomly generate two different sequences of tokens for the same text. In particular, the DeepSeek-R1-UD-IQ1_S generated the following token sequences which generate text containing .\n\n while testing:

	"Problematic":

4885 ' element'		16 '.'		271 '\n\n'		43 'I'
18249 ' ending'		16 '.'		271 '\n\n'		43 'I'
20820 ' satisfied'	16 '.'		271 '\n\n'		86020 'Avoid'
21429 ' skip'		16 '.'		271 '\n\n'		428 'â€œ'
31760 ' secrets'	16 '.'		271 '\n\n'		111116 'Across'
33949 ' vivid'		16 '.'		271 '\n\n'		103633 'Alright'

	"OK":

396 ' that'			339 '.\n\n'		43 'I'
850 ' more'			339 '.\n\n'		91678 'Dial'
1354 ' return'			339 '.\n\n'		1124 'In'
1060 ' over'			339 '.\n\n'		428 'â€œ'
1066 ' them'			339 '.\n\n'		6111 'One'
1722 ' way'			339 '.\n\n'		6111 'One'
1894 ' good'			339 '.\n\n'		671 'The'
2680 ' home'			339 '.\n\n'		2991 'As'
2680 ' home'			339 '.\n\n'		5013 '---'
2727 ' too'			339 '.\n\n'		428 'â€œ'
2783 ' art'			339 '.\n\n'		117260 'Conflict'
2934 ' development'		339 '.\n\n'		43 'I'
2939 'ries'			339 '.\n\n'		6111 'One'
3672 ' together'		339 '.\n\n'		5718 'Let'
3672 ' together'		339 '.\n\n'		72937 'Winter'
4124 'ening'			339 '.\n\n'		428 'â€œ'
4316 ' today'			339 '.\n\n'		3158 'He'
4468 ' paper'			339 '.\n\n'		6737 'She'
4844 ' needed'			339 '.\n\n'		6111 'One'
4885 ' element'			339 '.\n\n'		43 'I'
5446 ' friends'			339 '.\n\n'		52925 'Days'
5619 ' himself'			339 '.\n\n'		6737 'She'
6006 ' began'			339 '.\n\n'		36954 '---\n\n'
7010 ' letter'			339 '.\n\n'		12 '*'
7169 ' pages'			339 '.\n\n'		6111 'One'
7223 ' song'			339 '.\n\n'		4089 'On'
7223 ' song'			339 '.\n\n'		6111 'One'
8147 ' cold'			339 '.\n\n'		12808 'Then'
8369 ' chance'			339 '.\n\n'		13920 'End'
11219 ' scene', 		339 '.\n\n', 	100296 'Characters'
11350 ' sweet'			339 '.\n'n'		671 'The'
12013 ' interactions'		339 '.\n\n'		4117 'How'
12361 ' resolution'		339 '.\n\n'		32064 'Maybe'
12644 ' possibility'		339 '.\n\n'		428 'â€œ'
13801 ' atmosphere'		339 '.\n\n'		91678 'Dial'
14857 ' engagement'		339 '.\n\n'		82761 'Possible'
16205 ' rose'			339 '.\n\n'		428 'â€œ'
16545 'ritten'			339 '.\n\n'		7054 'From'
18249 ' ending'			339 '.\n\n'		43 'I'
19782 ' possibilities'		339 '.\n\n'		77653 'Night'
20760 ' timing',  		339 '.\n\n', 	72243 'Plot'
20955 ' replied'		339 '.\n\n'		8474 'They'
22524 ' ignore'			339 '.\n\n'		36954 '---\n\n'
26696 ' gaze'			339 '.\n\n'		100520 'Emma'
29862 ' obstacles'		339 '.\n\n'		43 'I'
34508 ' pause'			339 '.\n\n'		428 'â€œ'
37466 ' whispered'		339 '.\n\n'		428 'â€œ'
39636 ' lover'			339 '.\n\n'		2991 'As'
48032 ' tiles'			339 '.\n\n'		6111 'One'
56606 ' bites'			339 '.\n\n'		52925 'Days'
56616 ' faded'			339 '.\n\n'		2991 'As'
58588 ' crafting'		339 '.\n\n'		8229 'After'
59571 'ophone'			339 '.\n\n'		6111 'One'
61323 ' whisper'		339 '.\n\n'		32374 'Jul'
61375 ' endured'		339 '.\n\n'		43 'I'
62179 ' confession'		339 '.\n\n'		13920 'End'
83650 'ishly'			339 '.\n\n'		46 'L'
95042 ' bakery'			339 '.\n\n'		57935 'Years'
111555 ' baker'			339 '.\n\n'		41610 'Their'

In particular, if we look at the first two items from the first list, we will find similar tokens (that produce the same output text) in the second list:

4885 ' element'		16 '.'		271 '\n\n'		43 'I'
18249 ' ending'		16 '.'		271 '\n\n'		43 'I'

vs

4885 ' element'			339 '.\n\n'		43 'I'
18249 ' ending'			339 '.\n\n'		43 'I'

All 4 sequences have been generated by the same model, so that confirms my hypothesis from yesterday, that it's impossible to reconstruct the exact sequence of tokens from the cache based on just the prompt text.

I think this also shows that there is nothing wrong with the tokenizer, and any attempt to "fix" the tokenizer would make it worse, as it would then match the currently "problematic" sequences, but it will fail for the ones that are now in the "OK" list, which is much larger (6 cases in the "problematic" list vs 59 in the "ok" list).

Yesterday, I also installed a development environment and tried to assess the feasibility of my proposal from yesterday (to try to compare the cache and the prompt at the text level, instead of token level, when deciding how much to keep from the cache), and after a few hours it became obvious that it wouldn't be feasible for a variety of reasons, it would require way too many changes to the existing code, basically a rewrite.

However, I had a new idea: a hybrid tokenizer. Basically, instead of tokenizing the prompt just using the vocabulary, the tokenization would use the slot caches as well. First, it will tokenize the part of the prompt that matches the slot cache using the exact tokens from the cache, then, for the remainder of the prompt that doesn't match the cache (the new part of the prompt), it will tokenize it as usual based on just the vocabulary. If more than one slot matches the prompt, it will use the cache that matches the most characters from the prompt.

Today I started working on this idea, and I was able to make it work. This is the diff for my changes to server.cpp:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 2306dc26..97332894 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -4,6 +4,7 @@
 #include "common.h"
 #include "json-schema-to-grammar.h"
 #include "llama.h"
+#include "llama-vocab.h"
 #include "log.h"
 #include "sampling.h"
 #include "speculative.h"
@@ -3841,7 +3842,65 @@ int main(int argc, char ** argv) {
             // TODO: this log can become very long, put it behind a flag or think about a more compact format
             //SRV_DBG("Prompt: %s\n", prompt.is_string() ? prompt.get<std::string>().c_str() : prompt.dump(2).c_str());

-            std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
+            std::vector<llama_tokens> tokenized_prompts; // start of new tokenization code based on caches; it may need optimizations and bug fixes
+            if (prompt.is_string()) { // attempt tokenization based on the slot token caches first, only for prompts consisting of a single string
+                llama_tokens cache_based_tokenization;
+                std::string prompt_string = prompt.get<std::string>();
+                size_t max_prompt_match_in_chars = 0;
+
+                SRV_DBG("Attempting slot cache based tokenization of the prompt, total prompt length %d characters.\n", prompt_string.size());
+                for (size_t slot_index = 0; slot_index < ctx_server.slots.size(); slot_index++) {
+                    size_t  prompt_index = 0;
+                    size_t cache_index  = 0;
+                    llama_tokens partially_tokenized_prompt;
+                    llama_tokens cache_tokens = ctx_server.slots[slot_index].cache_tokens; // accessing the caches like this might be unsafe
+
+                    if (cache_tokens.size() > 0) {
+                        SRV_DBG("Slot %d has %d cached tokens, attempting prompt tokenization based on them.\n", slot_index, cache_tokens.size());
+                        for (cache_index = 0; cache_index < cache_tokens.size() && prompt_index < prompt_string.size(); cache_index++) {
+                            llama_token       token        = cache_tokens[cache_index];
+                            const std::string token_string = common_token_to_piece(ctx_server.vocab, token, true);
+                            size_t token_size = token_string.size();
+
+                            if (prompt_index + token_size <= prompt_string.size() && prompt_string.compare(prompt_index, token_size, token_string) == 0) {
+                                prompt_index += token_size;
+                                partially_tokenized_prompt.push_back(token);
+                            } else if (cache_index == 0) { // the first token from the cache doesn't have to be in the prompt, as it might be a BOS token, so just add it. This might cause issues.
+                                partially_tokenized_prompt.push_back(token);
+                            } else {
+                                break;
+                            }
+                        }
+
+                        if (prompt_index > max_prompt_match_in_chars) { // the tokenization based on this slot matches more characters than the previous best match
+                            max_prompt_match_in_chars = prompt_index;
+                            cache_based_tokenization  = partially_tokenized_prompt;
+                        }
+                    }
+                }
+
+                if (max_prompt_match_in_chars > 0) {  // if some of the prompt was tokenized based on the slot caches
+                    std::string              remaining_string = prompt_string.substr(max_prompt_match_in_chars);
+                    std::vector<llama_token> remaining_prompt_tokens = common_tokenize(ctx_server.vocab, remaining_string, true, true); // tokenize the rest of the prompt normally
+
+                    SRV_DBG("The slot caches based tokenization has produced %d tokens and the regular tokenization an additional %d tokens for a total of %d.\n",
+                        cache_based_tokenization.size(), remaining_prompt_tokens.size(), cache_based_tokenization.size() + remaining_prompt_tokens.size());
+
+                    // concatenate the additional tokens to the cached tokens, but skip the additinal BOS, as we don't need one in the middle of the tokens. This might cause issues.
+                    if (remaining_prompt_tokens.size() > 1) {
+                        cache_based_tokenization.insert(cache_based_tokenization.end(), remaining_prompt_tokens.begin() + 1, remaining_prompt_tokens.end());
+                    }
+
+                    tokenized_prompts.push_back(cache_based_tokenization);
+                } else {
+                    SRV_DBG("Partial tokenization of the %d character long prompt based on slot caches was not possible.\n", prompt_string.size());
+                }
+            }
+
+            if (tokenized_prompts.empty()) { // if the slot token cache based tokenization was not possible, tokenize the prompt normally
+                tokenized_prompts = tokenize_input_prompts(ctx_server.vocab, prompt, true, true);
+            } // end of new tokenization code based on caches
+
             tasks.reserve(tokenized_prompts.size());
             for (size_t i = 0; i < tokenized_prompts.size(); i++) {
                 server_task task = server_task(type);

Right now, the handle_completions_impl function calls the tokenize_input_prompts function to tokenize the input prompt. The new logic replaces the line that was calling the tokenize_input_prompts function.

With these changes, I haven't been able to reproduce the problem anymore.

I'll now list the limitations and potential issues for my logic:

My new logic works only for a prompt consisting of a single JSON string. For other types of prompts, it doesn't alter the existing logic in any way, so it won't fix similar issues for other types of prompts.
The new logic accesses the slot caches through ctx_server.slots[].cache_tokens. I'm not entirely sure, but that might not be safe, especially when some of the slots are busy processing something.
The new logic tries to match the prompts vs the caches starting with the beginning of each cache. If the matching tokens start somewhere in the middle of the cache for a slot, it will not use the cache of that slot for tokenization. While for my use case that is not an issue, in other situations my logic might still not fix the problem.
The chat prompt comes to the handle_completions_impl function with the chat template applied, but it is missing the BOS token. As such, when comparing the prompt with the caches, which contain the BOS token, I have to ignore that initial mismatch. In theory, that may cause issues in some edge cases, and result in an incorrect tokenization at the start of the prompt. Currently, I do not verify that the first token from the cache is indeed a BOS token before proceeding further with using the cache for tokenization, but it might be needed to handle edge cases.
I use the common_tokenize function to tokenize the part of the prompt that doesn't match the cache. From what I have seen, this function adds a BOS token at the beginning of the tokenization. Since I need to concatenate that with the part of the tokenization that is done based on the cache, I need to remove this additional BOS before the concatenation, and I do that by just skipping the first token returned by common_tokenize, without verifying that it is indeed a BOS token. If my assumption that the common_tokenize function always returns a tokenization that starts with a BOS token is incorrect, this may cause issues. To be on the safe side, the code will probably have to be modified to make this check before skipping the first token.
I tested my code with chat sessions using the DeepSeek-R1-UD-IQ1_S and Qwen2.5-7B-Instruct-1M-Q4_K_M models. While I didn't experience any issues and it solved the problem, I didn't use concurrent sessions and other types of API calls that use this function, so the new logic might still have some other hidden issues.
The new logic should probably be moved to a separate function.
I didn't write any C or C++ code for over 20 years, so my code might not be optimal regarding memory usage.

vnicolici added the bug-unconfirmed label Feb 20, 2025

slaren added bug Something isn't working and removed bug-unconfirmed labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

vnicolici commented Feb 21, 2025

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

Comments

vnicolici commented Feb 20, 2025

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

vnicolici commented Feb 20, 2025

ggerganov commented Feb 20, 2025

vnicolici commented Feb 20, 2025

vnicolici commented Feb 21, 2025