the result of Ultravox-v0.5-LLaMA-3.1-8B #11

Gpwner · 2025-03-04T03:40:08Z

I have test Ultravox-v0.5-LLaMA-3.1-8B too.but My test results are slightly different from yours, especially on the sdqa dataset.

	AlpacaEval	CommonEval	SD-QA	OpenBookQA	IFEval	AdvBench
Open-Ended QA	Open-Ended QA	Reference-Based QA	Multiple-Choice QA	Instruction Following	Safety
samples	199	200	553	455	345	520
Ultravox0.5 LLama3.1 8B Instruct	4.75	4.08	72.42	69.01	68.05	98.84

Gpwner · 2025-03-04T06:13:29Z

the last point of the sd-qa is(panda+gpt)/2,right?

MatthewCYM · 2025-03-04T13:44:36Z

Hi, I just notice that using qa_metrics==0.2.17 (as specified in the requirements) results in a Panda score of 47.74, which aligns with the current performance reported on the leaderboard. However, when using the latest version, qa_metrics==0.2.30, the Panda score unexpectedly jumps to 74.50. I'm currently investigating this discrepancy.

MatthewCYM · 2025-03-04T13:47:46Z

Could you please check if you observe the same discrepancy?

Gpwner · 2025-03-13T06:17:59Z

I am using qa_metrics==0.2.24. When i install qa_metrics==0.2.17 and run the evaluator i got this error:

    panda_results = [self.pedant.evaluate([item['reference'].lower()], item['response'].lower(), item['prompt'].lower()) for item in data]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/extendisk1/usr/bencheval/VoiceBench/src/evaluator/qa.py", line 22, in <listcomp>
    panda_results = [self.pedant.evaluate([item['reference'].lower()], item['response'].lower(), item['prompt'].lower()) for item in data]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/qa_metrics/pedant.py", line 267, in evaluate
    output = self.model.predict(result)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 451, in predict
    scores = self.decision_function(X)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 432, in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/base.py", line 626, in _validate_data
    self._check_n_features(X, reset=reset)
  File "/home/usr/miniconda3/envs/vllm/lib/python3.11/site-packages/sklearn/base.py", line 415, in _check_n_features
    raise ValueError(
ValueError: X has 82735 features, but SGDClassifier is expecting 82768 features as input.

Gpwner · 2025-03-13T06:20:02Z

by the way,when i evaluate ifeval,does the last point cal by (strict-prompt+strict-instruction+loose-prompt+loose-instruction)/4 ? thanks

MatthewCYM · 2025-03-13T07:57:51Z

Yes, it's calculated as (strict-prompt+strict-instruction+loose-prompt+loose-instruction)/4.

MatthewCYM · 2025-03-13T15:01:25Z

According to zli12321/qa_metrics#2, the panda score calculation method changed in version 0.2.17 and later, explaining the observed score discrepancy.

zli12321 · 2025-03-13T20:41:53Z

The panda score is tuned to be more correlated with GPT-4 in version 0.2.17 and later (0.2.30 is currently the latest). The jump of score is because pandas gives a more relaxed match than the older version.
Old panda accuracy 0.47 and GPT-4 is 0.68. New panda accuracy 0.76 and is closer to GPT-4 accuracy.

Gpwner · 2025-03-14T03:19:21Z

so,will the leadboard update to use the latest qa_metrics == 0.2.30 ? @MatthewCYM

zli12321 · 2025-03-14T04:56:41Z

The correlation is closer to LLM evaluation, but since the score averages out with get, the overall rankings for existing leaderboard will mostly remain the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the result of Ultravox-v0.5-LLaMA-3.1-8B #11

the result of Ultravox-v0.5-LLaMA-3.1-8B #11

Gpwner commented Mar 4, 2025

Gpwner commented Mar 4, 2025

MatthewCYM commented Mar 4, 2025

MatthewCYM commented Mar 4, 2025

Gpwner commented Mar 13, 2025

Gpwner commented Mar 13, 2025

MatthewCYM commented Mar 13, 2025

MatthewCYM commented Mar 13, 2025

zli12321 commented Mar 13, 2025

Gpwner commented Mar 14, 2025

zli12321 commented Mar 14, 2025

the result of Ultravox-v0.5-LLaMA-3.1-8B #11

the result of Ultravox-v0.5-LLaMA-3.1-8B #11

Comments

Gpwner commented Mar 4, 2025

Gpwner commented Mar 4, 2025

MatthewCYM commented Mar 4, 2025

MatthewCYM commented Mar 4, 2025

Gpwner commented Mar 13, 2025

Gpwner commented Mar 13, 2025

MatthewCYM commented Mar 13, 2025

MatthewCYM commented Mar 13, 2025

zli12321 commented Mar 13, 2025

Gpwner commented Mar 14, 2025

zli12321 commented Mar 14, 2025