[Accuracy gap with official model card due to wrong parsing] #2707

Monstertail · 2025-02-17T03:49:01Z

I tested the accuracy of gsm8k-cot on Qwen2-7B-Instruct whose model card shows an accuracy of 0.82. However, I tested on lm-eval-harness, no matter gsm8k and gsm8k-cot, there is still a significant accuracy gap.

[gsm8k]

VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 lm_eval --model vllm --model_args pretrained=Qwen/Qwen2-7B-Instruct,dtype=auto --tasks gsm8k --device cuda:1 --apply_chat_template --fewshot_as_multiturn  --num_fewshot 8 --gen_kwargs temperature=0 --batch_size auto --seed 123 --output_path /data/jinwei/bench_res/    --log_samples

[gsm8k-cot]

CUDA_VISIBLE_DEVICES=0 lm_eval --model sglang     --model_args pretrained=Qwen/Qwen2-7B-Instruct,dtype=auto    --tasks gsm8k_cot     --device "cuda"        --apply_chat_template           --fewshot_as_multiturn          --num_fewshot 8         --gen_kwargs temperature=0     --batch_size auto     --seed 123    --output_path /data/jinwei/bench_res/    --log_samples

I analyzed the output logs and figure out the reason: the parser cannot detect many correct answers in "exact match". It can only extract the answer with the format of "The answer is x."

Some error patterns: with "filtered_resps": ["[invalid]"], "filter": "strict-match", "metrics": ["exact_match"]:
...The answer is \\(366\\).
...Therefore, the answer is 23 jewels.
...Therefore, Brandon's iPhone is 8 years old.
... The answer is: $40.

Therefore, I modified the prompt here by simply add (Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.).

It works pretty well by telling the model output format!
The strict accuracy raised from 0.57 to 0.80, much closer to the model card:)

Will you add this formatting for more tasks if possible? I believe it can bridge the gap with model card in HF:)

The text was updated successfully, but these errors were encountered:

Qubitium · 2025-02-17T09:41:07Z

@Monstertail I agree. I think you hit an import bug here. This would explain why some of the quant models are scoring higher scores than even natives at some tests since the calibration data in quantization (GPTQModel as example) may align the output to have a more structured output such as "The answer is".

@baberabb

Monstertail mentioned this issue Feb 17, 2025

Support SGLang as Potential Backend for Evaluation #2703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Accuracy gap with official model card due to wrong parsing] #2707

[Accuracy gap with official model card due to wrong parsing] #2707

Monstertail commented Feb 17, 2025

Qubitium commented Feb 17, 2025 •

edited

Loading

[Accuracy gap with official model card due to wrong parsing] #2707

[Accuracy gap with official model card due to wrong parsing] #2707

Comments

Monstertail commented Feb 17, 2025

Qubitium commented Feb 17, 2025 • edited Loading

Qubitium commented Feb 17, 2025 •

edited

Loading