vLLM CUDA OOM for `loglikelihood`, but not for `generate_until` #2698

lsjlsj5846 · 2025-02-14T06:48:00Z

Hello,

I persistently get issue of CUDA OOM when trying to evaluate DeepSeek-R1-Distill-Llama-8B model (locally downloaded) on HAE-RAE benchmark.
My command is as follows:

lm_eval --model vllm --model_args "pretrained=...,max_model_len=4096,max_num_seqs=256" --batch_size auto --tasks haerae

I have to reduce max_num_seqs to 4 not to get the OOM error.

However, when I evaluate the model on KMMLU benchmark, there is no issue of CUDA OOM.
My command was:

lm_eval --model vllm --model_args "pretrained=...,max_model_len=4096,max_num_seqs=256" --batch_size auto --tasks kmmlu_direct

The only difference is that HAE-RAE benchmark uses loglikelihood while KMMLU uses generate_until.
Can anyone explain why this is happening?

Thank you in advance.

The text was updated successfully, but these errors were encountered:

baberabb · 2025-02-14T13:05:56Z

Hi! This is a known issue with vllm. setting gpu_memory_utilization quite low helps (it depends on model as it's shared by the kv-cache and the weights). Setting the batch_size manually also helps.

baberabb added the asking questions For asking for clarification / support on library usage. label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM CUDA OOM for `loglikelihood`, but not for `generate_until` #2698

vLLM CUDA OOM for `loglikelihood`, but not for `generate_until` #2698

lsjlsj5846 commented Feb 14, 2025

baberabb commented Feb 14, 2025

vLLM CUDA OOM for loglikelihood, but not for generate_until #2698

vLLM CUDA OOM for loglikelihood, but not for generate_until #2698

Comments

lsjlsj5846 commented Feb 14, 2025

baberabb commented Feb 14, 2025

vLLM CUDA OOM for `loglikelihood`, but not for `generate_until` #2698

vLLM CUDA OOM for `loglikelihood`, but not for `generate_until` #2698